mlflow/mlflow
Voir sur GitHub[FR] Additional built-in LLM judges for safety, coherence, agent planning, ...
Open
#19 061 ouverte le 26 nov. 2025
area/evaluationdomain/genaienhancementhelp wanted
Métriques du dépôt
- Stars
- (17 127 stars)
- Métriques de merge PR
- (Merge moyen 2j 13h) (367 PRs mergées en 30 j)
Description
Willingness to contribute
I cannot contribute this myself, and am requesting help from other contributors
Description
Expand MLflow's built-in judge library.
These judges should be production-ready, well-tested, and work out-of-the-box with minimal configuration.
Proposed Solution
Add the following built-in judges to mlflow.genai.scorers:
1. Conversational Safety (multi-turn)
mlflow.genai.scorers.ConversationSafety()
- Goal: Evaluate safety across the entire conversation
- Outcome: Binary pass/fail with identification of safety concerns
- Rationale: Critical for production deployment of conversational agents
2. Conversational Tool Call Efficiency (multi-turn)
mlflow.genai.scorers.ConversationalToolCallEfficiency()
- Goal: Assess tool usage efficiency across the full conversation session
- Outcome: Binary pass/fail on session-level tool call optimization
- Rationale: Extends single-turn efficiency to full conversation context
3. Conversational Role Adherence (multi-turn)
mlflow.genai.scorers.ConversationalRoleAdherence()
- Goal: Verify agent maintains its assigned role throughout the conversation
- Outcome: Binary pass/fail on role consistency
- Rationale: Ensures agents stay within defined boundaries and personas
5. Conversational Coherence (multi-turn)
mlflow.genai.scorers.ConversationalCoherence()
- Goal: Assess logical flow and consistency across the entire conversation
- Outcome: Binary pass/fail on session-level coherence
- Rationale: Multi-turn extension of single-turn coherence evaluation
6. Agent Plan Quality (multi-turn)
mlflow.genai.scorers.AgentPlanQuality()
- Goal: Evaluate the quality of agent's action planning and reasoning
- Outcome: Binary pass/fail with assessment of planning effectiveness
- Rationale: Important for agentic systems that decompose tasks into steps
Implementation Considerations
- Consistent API: All judges should follow the same calling convention as existing built-in judges
- Documentation: Comprehensive examples showing when to use each judge
Related
- Extends #19056 (additional built-in judges)
- Builds on #19052 (make_judge for conversations)
- Works with #19055 (offline evaluation)
- Integrates with #19058 (judge builder UI)