mlflow/mlflow

[FR] Additional built-in LLM judges for safety, coherence, agent planning, ...

Open

#19 061 ouverte le 26 nov. 2025

Voir sur GitHub
 (3 commentaires) (0 réactions) (1 assigné)Python (3 904 forks)batch import
area/evaluationdomain/genaienhancementhelp wanted

Métriques du dépôt

Stars
 (17 127 stars)
Métriques de merge PR
 (Merge moyen 2j 13h) (367 PRs mergées en 30 j)

Description

Willingness to contribute

I cannot contribute this myself, and am requesting help from other contributors

Description

Expand MLflow's built-in judge library.

These judges should be production-ready, well-tested, and work out-of-the-box with minimal configuration.

Proposed Solution

Add the following built-in judges to mlflow.genai.scorers:

1. Conversational Safety (multi-turn)

mlflow.genai.scorers.ConversationSafety()
  • Goal: Evaluate safety across the entire conversation
  • Outcome: Binary pass/fail with identification of safety concerns
  • Rationale: Critical for production deployment of conversational agents

2. Conversational Tool Call Efficiency (multi-turn)

mlflow.genai.scorers.ConversationalToolCallEfficiency()
  • Goal: Assess tool usage efficiency across the full conversation session
  • Outcome: Binary pass/fail on session-level tool call optimization
  • Rationale: Extends single-turn efficiency to full conversation context

3. Conversational Role Adherence (multi-turn)

mlflow.genai.scorers.ConversationalRoleAdherence()
  • Goal: Verify agent maintains its assigned role throughout the conversation
  • Outcome: Binary pass/fail on role consistency
  • Rationale: Ensures agents stay within defined boundaries and personas

5. Conversational Coherence (multi-turn)

mlflow.genai.scorers.ConversationalCoherence()
  • Goal: Assess logical flow and consistency across the entire conversation
  • Outcome: Binary pass/fail on session-level coherence
  • Rationale: Multi-turn extension of single-turn coherence evaluation

6. Agent Plan Quality (multi-turn)

mlflow.genai.scorers.AgentPlanQuality()
  • Goal: Evaluate the quality of agent's action planning and reasoning
  • Outcome: Binary pass/fail with assessment of planning effectiveness
  • Rationale: Important for agentic systems that decompose tasks into steps

Implementation Considerations

  1. Consistent API: All judges should follow the same calling convention as existing built-in judges
  2. Documentation: Comprehensive examples showing when to use each judge

Related

  • Extends #19056 (additional built-in judges)
  • Builds on #19052 (make_judge for conversations)
  • Works with #19055 (offline evaluation)
  • Integrates with #19058 (judge builder UI)

Guide contributeur