Follow-up from the Codex review on #858. Two findings were correct but intentionally deferred out of that PR (rather than fixed) because each needs a deliberate decision, not a mechanical fix. Capturing them here so "deferred" doesn't quietly become "dropped."
Both live in scripts/eval/cross_model_review/.
1. Renamed-file blame uses the new-side path → corpus under-sampling
build_corpus.py parse_hunk_ranges stores mg.group(2) — the b/... (new-side) path from diff --git a/old b/new. But blame runs against the fix's parent (fix_sha^), where a renamed file only exists under the old (a/...) path. So for a fix: commit that renames a file, blame_candidates blames a path that doesn't exist in the parent, silently drops those hunks, and the entry is missed.
- Impact: under-samples renamed-file fixes in the Tier-3 corpus. Not a wrong-results bug — scored entries are still correct — which is why it was deferred despite the P1 badge.
- Likely fix: use
mg.group(1) (pre-image path) for parent blame; for non-renames a == b so nothing changes. Add a rename fixture (diff --git a/old.py b/new.py with an edit hunk) and assert a candidate is produced.
- Codex finding: "Blame fix-parent using pre-image path for renamed files" (
build_corpus.py:82).
2. Finding UID is content-based, not per-trial — methodology decision
run_arms.py _finding_uid hashes (arm, doc_id, fid, text) with no trial component. With the decision-grade methodology now at trials_per_arm: 3, an identical finding surfaced in multiple trials collapses to one UID in the provenance map, so yield-score counts it once.
This is a design decision, not obviously a bug: the content-based UID is deliberate (it dedups and keeps the judge blind). The open question is what a finding repeated across trials should mean for yield:
- Count once (current): "did this arm ever surface this bug" — a reliability-agnostic view.
- Count per-trial (Codex's suggestion): add trial to the UID — rewards consistency, but inflates raw yield for an arm that repeats itself.
We should decide explicitly and document it in the README's yield section, then make the code match. Whichever way, add a multi-trial test asserting the chosen behavior.
- Codex finding: "Include trial in GT finding UID" (
run_arms.py, _finding_uid / gt_pool).
(The third deferred item from #858 — doc_id filename sanitization — is purely defensive and left out deliberately; corpus IDs are slugs today. Not tracked here unless we start accepting path-like IDs.)
Follow-up from the Codex review on #858. Two findings were correct but intentionally deferred out of that PR (rather than fixed) because each needs a deliberate decision, not a mechanical fix. Capturing them here so "deferred" doesn't quietly become "dropped."
Both live in
scripts/eval/cross_model_review/.1. Renamed-file blame uses the new-side path → corpus under-sampling
build_corpus.pyparse_hunk_rangesstoresmg.group(2)— theb/...(new-side) path fromdiff --git a/old b/new. But blame runs against the fix's parent (fix_sha^), where a renamed file only exists under the old (a/...) path. So for afix:commit that renames a file,blame_candidatesblames a path that doesn't exist in the parent, silently drops those hunks, and the entry is missed.mg.group(1)(pre-image path) for parent blame; for non-renamesa == bso nothing changes. Add a rename fixture (diff --git a/old.py b/new.pywith an edit hunk) and assert a candidate is produced.build_corpus.py:82).2. Finding UID is content-based, not per-trial — methodology decision
run_arms.py_finding_uidhashes(arm, doc_id, fid, text)with no trial component. With the decision-grade methodology now attrials_per_arm: 3, an identical finding surfaced in multiple trials collapses to one UID in theprovenancemap, soyield-scorecounts it once.This is a design decision, not obviously a bug: the content-based UID is deliberate (it dedups and keeps the judge blind). The open question is what a finding repeated across trials should mean for yield:
We should decide explicitly and document it in the README's yield section, then make the code match. Whichever way, add a multi-trial test asserting the chosen behavior.
run_arms.py,_finding_uid/gt_pool).(The third deferred item from #858 —
doc_idfilename sanitization — is purely defensive and left out deliberately; corpus IDs are slugs today. Not tracked here unless we start accepting path-like IDs.)