cross-model-eval: two deferred follow-ups from #858 (rename-blame path; trial-in-finding-UID methodology)

Follow-up from the Codex review on #858. Two findings were correct but intentionally deferred out of that PR (rather than fixed) because each needs a deliberate decision, not a mechanical fix. Capturing them here so "deferred" doesn't quietly become "dropped."

Both live in `scripts/eval/cross_model_review/`.

---

### 1. Renamed-file blame uses the new-side path → corpus under-sampling

`build_corpus.py` `parse_hunk_ranges` stores `mg.group(2)` — the `b/...` (new-side) path from `diff --git a/old b/new`. But blame runs against the fix's **parent** (`fix_sha^`), where a renamed file only exists under the **old** (`a/...`) path. So for a `fix:` commit that renames a file, `blame_candidates` blames a path that doesn't exist in the parent, silently drops those hunks, and the entry is missed.

- **Impact:** under-samples renamed-file fixes in the Tier-3 corpus. *Not* a wrong-results bug — scored entries are still correct — which is why it was deferred despite the P1 badge.
- **Likely fix:** use `mg.group(1)` (pre-image path) for parent blame; for non-renames `a == b` so nothing changes. Add a rename fixture (`diff --git a/old.py b/new.py` with an edit hunk) and assert a candidate is produced.
- Codex finding: "Blame fix-parent using pre-image path for renamed files" (`build_corpus.py:82`).

### 2. Finding UID is content-based, not per-trial — methodology decision

`run_arms.py` `_finding_uid` hashes `(arm, doc_id, fid, text)` with **no trial component**. With the decision-grade methodology now at `trials_per_arm: 3`, an identical finding surfaced in multiple trials collapses to one UID in the `provenance` map, so `yield-score` counts it once.

This is a **design decision, not obviously a bug**: the content-based UID is deliberate (it dedups and keeps the judge blind). The open question is what a finding repeated across trials *should* mean for yield:

- **Count once (current):** "did this arm ever surface this bug" — a reliability-agnostic view.
- **Count per-trial (Codex's suggestion):** add trial to the UID — rewards consistency, but inflates raw yield for an arm that repeats itself.

We should decide explicitly and document it in the README's yield section, then make the code match. Whichever way, add a multi-trial test asserting the chosen behavior.

- Codex finding: "Include trial in GT finding UID" (`run_arms.py`, `_finding_uid` / `gt_pool`).

---

(The third deferred item from #858 — `doc_id` filename sanitization — is purely defensive and left out deliberately; corpus IDs are slugs today. Not tracked here unless we start accepting path-like IDs.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cross-model-eval: two deferred follow-ups from #858 (rename-blame path; trial-in-finding-UID methodology) #864

1. Renamed-file blame uses the new-side path → corpus under-sampling

2. Finding UID is content-based, not per-trial — methodology decision

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cross-model-eval: two deferred follow-ups from #858 (rename-blame path; trial-in-finding-UID methodology) #864

Description

1. Renamed-file blame uses the new-side path → corpus under-sampling

2. Finding UID is content-based, not per-trial — methodology decision

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions