Skip to content

Default agent and judge models are identical — out-of-the-box self-evaluation #336

@harshalmore31

Description

@harshalmore31

Problem

In the configuration as shipped, the agent under test and the LLM judge end up being the same model — so the judge is grading its own outputs. This is self-preference bias [1] applied directly.

Concrete path:

  • src/agent/claude_agent/runner.py:36 and cli.py:16 default the candidate model to litellm_proxy/aws/claude-opus-4-6.
  • src/agent/deep_agent/runner.py:39 and cli.py:16 default to the same.
  • INSTRUCTIONS.md:244 and docs/evaluation.md:90 use --judge-model litellm_proxy/aws/claude-opus-4-6 as the canonical evaluate example.

A user following the README runs claude-agent (claude-opus-4-6) and evaluates with --judge-model litellm_proxy/aws/claude-opus-4-6 — claude-opus-4-6 judging claude-opus-4-6. No user error required; this is the documented path.

The paper has the same issue at one row of the leaderboard: llama-4-maverick is both a candidate model and the chosen judge model.

Why this is distinct from #281 and #296

Reference

[1] Panickssery, A., Bowman, S. R., Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076.

Proposed solution

One principle: the judge is held constant across the entire leaderboard, and the judge is not a candidate. Pick a side — the same model cannot be both. Today's defaults make it both.

The concrete changes follow from that:

  1. Decide which side claude-opus-4-6 is on. Either:

    • Keep it as the judge → remove it as the default candidate for claude-agent and deep-agent (swap to a different model, e.g. litellm_proxy/azure/gpt-5.4 like openai-agent), and document that claude-opus-4-6 is reserved as the leaderboard judge and is not benchmarked as a candidate, OR
    • Keep it as a candidate → switch the documented judge in INSTRUCTIONS.md:244 and docs/evaluation.md:90 to a model that is not on the candidate pool.
  2. Add a guard in the eval CLI. When --judge-model resolves to the same identifier as the trajectory's model field for a row, abort that row with a clear error (not a silent warning). Catches the violation regardless of how the user configured things — config drift, future model swaps, custom runners.

  3. Shrink the LLM judge's surface area. Whether the agent called the right tool, in the right order, on the right asset / time window is deterministically checkable against the persisted trajectory and the execution_steps / execution_links ground truth. src/evaluation/scorers/code_based.py currently raises NotImplementedError. Wiring it up lets llm_judge keep only the genuinely subjective parts (reasoning quality, justification) — and any residual self-eval bias rides on a much smaller fraction of the score.

(1) and (2) together solve the self-judging problem outright. (3) is independent and compounds — smaller LLM-judge surface means smaller impact from any residual single-model bias, including this one.

Happy to send the PR for (2) — the eval-CLI guard is the smallest patch and prevents regression regardless of which way you go on (1).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions