You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mark re-curation handoff as complete, add /curate-ground-truth skill
- Update handoff doc with final results (381/381, IR metrics, V2 report)
- Add curate-ground-truth skill for future curator runs via Daytona
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
We re-curated ground truth for 311/367 benchmark tasks using a calibrated curator agent (Opus 4.6, phase1 prompt, hybrid backend). The new groundtruth files are `_agent` variants that exist alongside the original manually-authored files.
3
+
> **Status**: All tasks complete. This doc is kept for historical reference.
4
+
> **Skill**: Use `/curate-ground-truth` for future curation runs.
5
5
6
-
**Commit**: `dd4d62eec3` — "Add calibrated curator ground truth (311/367) and harden Daytona sandbox lifecycle"
6
+
## Final Results
7
7
8
-
## What Was Done
9
-
-**Org: 207/207 complete** — all tasks have `oracle_answer_agent.json` in `benchmarks/csb_org_*/*/tests/`
10
-
-**SDLC: 104/160 complete** — tasks have `ground_truth_agent.json` in `benchmarks/csb_sdlc_*/*/tests/`
11
-
-**56 SDLC tasks still missing** — blocked by OAuth rate limits (Accounts 2+3 limited until Mar 6 3am UTC, Account 1 available)
12
-
- Missing SDLC concentrated in: `test` (16), `understand` (11), `debug` (10, 4 are known linux `--branch` parse bugs), `secure` (6), `document` (4), `feature` (4), `refactor` (2), `fix` (2), `design` (1)
8
+
-**381/381 tasks** re-curated and promoted to canonical (160 SDLC + 221 Org)
After completion, 4 linux kernel tasks will still fail (`linux-acpi-backlight-fault-001`, `linux-hda-intel-suspend-fault-001`, `linux-iwlwifi-subdevice-fault-001`, `linux-nfs-inode-revalidate-fault-001`) — their Dockerfiles use `git clone --branch` which gets parsed as a repo slug. These need manual ground truth.
21
+
| Metric | Value |
22
+
|--------|-------|
23
+
| Computable tasks | 1,921 |
24
+
| File recall (mean) | 0.394 |
25
+
| MRR | 0.352 |
26
+
| MAP | 0.239 |
27
+
| mcp-remote-artifact recall | 0.596 |
28
+
| baseline-local-direct recall | 0.330 |
34
29
35
-
##Task 2: Promote Agent Oracles
30
+
### V2 Report Summary
36
31
37
-
After all tasks complete, promote `_agent` variants to canonical:
38
-
```bash
39
-
python3 scripts/promote_agent_oracles.py --force
40
-
```
41
-
42
-
This replaces `ground_truth.json` / `oracle_answer.json` with the calibrated `_agent` versions.
43
-
44
-
## Task 3: Re-Run IR Analysis
45
-
46
-
The IR evaluation pipeline reads:
47
-
- SDLC: `ground_truth.json` (so promotion must happen first)
48
-
- Org: `oracle_answer.json` first, then `ground_truth.json` fallback
49
-
50
-
After promotion, regenerate the IR analysis:
51
-
```bash
52
-
# Normalize retrieval events from all official runs
0 commit comments