Default agent and judge models are identical — out-of-the-box self-evaluation

### Problem

In the configuration as shipped, the agent under test and the LLM judge end up being the **same model** — so the judge is grading its own outputs. This is self-preference bias [1] applied directly.

Concrete path:

- `src/agent/claude_agent/runner.py:36` and `cli.py:16` default the candidate model to `litellm_proxy/aws/claude-opus-4-6`.
- `src/agent/deep_agent/runner.py:39` and `cli.py:16` default to the same.
- `INSTRUCTIONS.md:244` and `docs/evaluation.md:90` use `--judge-model litellm_proxy/aws/claude-opus-4-6` as the canonical evaluate example.

A user following the README runs `claude-agent` (claude-opus-4-6) and evaluates with `--judge-model litellm_proxy/aws/claude-opus-4-6` — claude-opus-4-6 judging claude-opus-4-6. No user error required; this is the documented path.

The paper has the same issue at one row of the leaderboard: `llama-4-maverick` is both a candidate model and the chosen judge model.

### Why this is distinct from #281 and #296

- **#281** (multi-LLM panel judging) addresses single-judge variance via ensembling.
- **#296** (closed, judge-vs-human calibration) addresses whether the judge agrees with humans.
- This issue is narrower and prior to both: **regardless of which judge or how many, a candidate must not be its own judge.** Exclude-self compounds with either — it isn't a substitute for them.

### Reference

[1] Panickssery, A., Bowman, S. R., Feng, S. (2024). *LLM Evaluators Recognize and Favor Their Own Generations.* arXiv:2404.13076.

### Proposed solution

One principle: **the judge is held constant across the entire leaderboard, and the judge is not a candidate.** Pick a side — the same model cannot be both. Today's defaults make it both.

The concrete changes follow from that:

1. **Decide which side claude-opus-4-6 is on.** Either:
   - Keep it as the judge → remove it as the default candidate for `claude-agent` and `deep-agent` (swap to a different model, e.g. `litellm_proxy/azure/gpt-5.4` like `openai-agent`), and document that claude-opus-4-6 is reserved as the leaderboard judge and is not benchmarked as a candidate, OR
   - Keep it as a candidate → switch the documented judge in `INSTRUCTIONS.md:244` and `docs/evaluation.md:90` to a model that is not on the candidate pool.

2. **Add a guard in the eval CLI.** When `--judge-model` resolves to the same identifier as the trajectory's `model` field for a row, abort that row with a clear error (not a silent warning). Catches the violation regardless of how the user configured things — config drift, future model swaps, custom runners.

3. **Shrink the LLM judge's surface area.** Whether the agent called the right tool, in the right order, on the right asset / time window is deterministically checkable against the persisted trajectory and the `execution_steps` / `execution_links` ground truth. `src/evaluation/scorers/code_based.py` currently raises `NotImplementedError`. Wiring it up lets `llm_judge` keep only the genuinely subjective parts (reasoning quality, justification) — and any residual self-eval bias rides on a much smaller fraction of the score.

(1) and (2) together solve the self-judging problem outright. (3) is independent and compounds — smaller LLM-judge surface means smaller impact from any residual single-model bias, including this one.

Happy to send the PR for (2) — the eval-CLI guard is the smallest patch and prevents regression regardless of which way you go on (1).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default agent and judge models are identical — out-of-the-box self-evaluation #336

Problem

Why this is distinct from #281 and #296

Reference

Proposed solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Default agent and judge models are identical — out-of-the-box self-evaluation #336

Description

Problem

Why this is distinct from #281 and #296

Reference

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions