Skip to content

cross-model-eval: two deferred follow-ups from #858 (rename-blame path; trial-in-finding-UID methodology) #864

@jaybna

Description

@jaybna

Follow-up from the Codex review on #858. Two findings were correct but intentionally deferred out of that PR (rather than fixed) because each needs a deliberate decision, not a mechanical fix. Capturing them here so "deferred" doesn't quietly become "dropped."

Both live in scripts/eval/cross_model_review/.


1. Renamed-file blame uses the new-side path → corpus under-sampling

build_corpus.py parse_hunk_ranges stores mg.group(2) — the b/... (new-side) path from diff --git a/old b/new. But blame runs against the fix's parent (fix_sha^), where a renamed file only exists under the old (a/...) path. So for a fix: commit that renames a file, blame_candidates blames a path that doesn't exist in the parent, silently drops those hunks, and the entry is missed.

  • Impact: under-samples renamed-file fixes in the Tier-3 corpus. Not a wrong-results bug — scored entries are still correct — which is why it was deferred despite the P1 badge.
  • Likely fix: use mg.group(1) (pre-image path) for parent blame; for non-renames a == b so nothing changes. Add a rename fixture (diff --git a/old.py b/new.py with an edit hunk) and assert a candidate is produced.
  • Codex finding: "Blame fix-parent using pre-image path for renamed files" (build_corpus.py:82).

2. Finding UID is content-based, not per-trial — methodology decision

run_arms.py _finding_uid hashes (arm, doc_id, fid, text) with no trial component. With the decision-grade methodology now at trials_per_arm: 3, an identical finding surfaced in multiple trials collapses to one UID in the provenance map, so yield-score counts it once.

This is a design decision, not obviously a bug: the content-based UID is deliberate (it dedups and keeps the judge blind). The open question is what a finding repeated across trials should mean for yield:

  • Count once (current): "did this arm ever surface this bug" — a reliability-agnostic view.
  • Count per-trial (Codex's suggestion): add trial to the UID — rewards consistency, but inflates raw yield for an arm that repeats itself.

We should decide explicitly and document it in the README's yield section, then make the code match. Whichever way, add a multi-trial test asserting the chosen behavior.

  • Codex finding: "Include trial in GT finding UID" (run_arms.py, _finding_uid / gt_pool).

(The third deferred item from #858doc_id filename sanitization — is purely defensive and left out deliberately; corpus IDs are slugs today. Not tracked here unless we start accepting path-like IDs.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions