Skip to content

Commit faa7b6f

Browse files
sjarmakclaude
andcommitted
Mark re-curation handoff as complete, add /curate-ground-truth skill
- Update handoff doc with final results (381/381, IR metrics, V2 report) - Add curate-ground-truth skill for future curator runs via Daytona Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 771f384 commit faa7b6f

File tree

1 file changed

+25
-84
lines changed

1 file changed

+25
-84
lines changed
Lines changed: 25 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,35 @@
1-
# Handoff: Re-Curation IR Analysis
1+
# Re-Curation IR Analysis — COMPLETED 2026-03-06
22

3-
## Context
4-
We re-curated ground truth for 311/367 benchmark tasks using a calibrated curator agent (Opus 4.6, phase1 prompt, hybrid backend). The new ground truth files are `_agent` variants that exist alongside the original manually-authored files.
3+
> **Status**: All tasks complete. This doc is kept for historical reference.
4+
> **Skill**: Use `/curate-ground-truth` for future curation runs.
55
6-
**Commit**: `dd4d62eec3` — "Add calibrated curator ground truth (311/367) and harden Daytona sandbox lifecycle"
6+
## Final Results
77

8-
## What Was Done
9-
- **Org: 207/207 complete** — all tasks have `oracle_answer_agent.json` in `benchmarks/csb_org_*/*/tests/`
10-
- **SDLC: 104/160 complete** — tasks have `ground_truth_agent.json` in `benchmarks/csb_sdlc_*/*/tests/`
11-
- **56 SDLC tasks still missing** — blocked by OAuth rate limits (Accounts 2+3 limited until Mar 6 3am UTC, Account 1 available)
12-
- Missing SDLC concentrated in: `test` (16), `understand` (11), `debug` (10, 4 are known linux `--branch` parse bugs), `secure` (6), `document` (4), `feature` (4), `refactor` (2), `fix` (2), `design` (1)
8+
- **381/381 tasks** re-curated and promoted to canonical (160 SDLC + 221 Org)
9+
- **Commits**: `b08164eae` (160 SDLC + 207 Org promoted), `58df5ae4d` (14 onboard-search added)
1310

14-
## Files Modified
15-
- `scripts/daytona_curator_runner.py` — hardened with orphan cleanup, auto-stop, signal handler, parallel=55 default
16-
- `benchmarks/csb_org_*/*/tests/oracle_answer_agent.json` — 207 new curator-generated Org oracle files
17-
- `benchmarks/csb_org_*/*/tests/ground_truth.json` — 207 updated (curator also writes canonical for Org)
18-
- `benchmarks/csb_org_*/*/tests/ground_truth_meta.json` — 207 metadata files
19-
- `benchmarks/csb_sdlc_*/*/tests/ground_truth_agent.json` — 104 new curator-generated SDLC ground truth files
20-
- `benchmarks/csb_sdlc_*/*/tests/ground_truth_meta.json` — 104 metadata files
11+
### Coverage Breakdown
2112

22-
## Task 1: Complete Remaining 56 SDLC Tasks
13+
| Source | Count | Notes |
14+
|--------|-------|-------|
15+
| Daytona curator (Opus 4.6, phase1, hybrid) | 356 | Automated via `daytona_curator_runner.py` |
16+
| Manual from canonical | 11 | 4 linux kernel (repo too large) + 7 large-repo timeouts |
17+
| Schema conversion (function_id → files) | 14 | `ccx-onboard-search-*` semantic retrieval tasks |
2318

24-
Account 1 is available. Run:
25-
```bash
26-
source .env.local && export HARBOR_ENV=daytona DAYTONA_OVERRIDE_STORAGE=10240 CCB_ACCOUNT=1
27-
python3 scripts/daytona_curator_runner.py \
28-
--sdlc-all --skip-agent-variants \
29-
--model claude-opus-4-6 --backend hybrid --prompt-version phase1 \
30-
--parallel 55
31-
```
19+
### IR Metrics (Post-Promotion)
3220

33-
After completion, 4 linux kernel tasks will still fail (`linux-acpi-backlight-fault-001`, `linux-hda-intel-suspend-fault-001`, `linux-iwlwifi-subdevice-fault-001`, `linux-nfs-inode-revalidate-fault-001`) — their Dockerfiles use `git clone --branch` which gets parsed as a repo slug. These need manual ground truth.
21+
| Metric | Value |
22+
|--------|-------|
23+
| Computable tasks | 1,921 |
24+
| File recall (mean) | 0.394 |
25+
| MRR | 0.352 |
26+
| MAP | 0.239 |
27+
| mcp-remote-artifact recall | 0.596 |
28+
| baseline-local-direct recall | 0.330 |
3429

35-
## Task 2: Promote Agent Oracles
30+
### V2 Report Summary
3631

37-
After all tasks complete, promote `_agent` variants to canonical:
38-
```bash
39-
python3 scripts/promote_agent_oracles.py --force
40-
```
41-
42-
This replaces `ground_truth.json` / `oracle_answer.json` with the calibrated `_agent` versions.
43-
44-
## Task 3: Re-Run IR Analysis
45-
46-
The IR evaluation pipeline reads:
47-
- SDLC: `ground_truth.json` (so promotion must happen first)
48-
- Org: `oracle_answer.json` first, then `ground_truth.json` fallback
49-
50-
After promotion, regenerate the IR analysis:
51-
```bash
52-
# Normalize retrieval events from all official runs
53-
python3 scripts/normalize_retrieval_events.py --runs-dir runs/official/
54-
55-
# Evaluate IR metrics against new ground truth
56-
python3 scripts/compute_retrieval_metrics.py --runs-dir runs/official/ --output results/ir/
57-
58-
# Generate the V2 report with updated IR numbers
59-
python3 scripts/extract_v2_report_data.py
60-
```
61-
62-
Key metrics to compare before/after promotion:
63-
- Per-suite F1, precision, recall
64-
- Baseline vs SG_full delta (does MCP advantage change with better ground truth?)
65-
- Overall aggregate F1
66-
67-
## Task 4: Quality Spot-Check (Before Promotion)
68-
69-
Before promoting, spot-check a sample of `_agent` vs canonical ground truth:
70-
```bash
71-
# Pick 5 random tasks and compare file lists
72-
for f in $(find benchmarks/csb_sdlc_* -name ground_truth_agent.json | shuf | head -5); do
73-
canonical=$(dirname "$f")/ground_truth.json
74-
echo "=== $(basename $(dirname $(dirname $f))) ==="
75-
echo "Canonical files: $(python3 -c "import json; print(len(json.load(open('$canonical')).get('expected_files', [])))" 2>/dev/null || echo "N/A")"
76-
echo "Agent files: $(python3 -c "import json; print(len(json.load(open('$f')).get('expected_files', [])))")"
77-
echo ""
78-
done
79-
```
80-
81-
Look for:
82-
- Agent producing 0 or 1 files (regex rescue, low quality) — should re-run
83-
- Agent producing 50+ files (over-inclusion) — may need review
84-
- Canonical having files the agent missed (recall regression)
32+
375 paired tasks: BL=0.459, MCP=0.480, delta=+0.021
8533

8634
## Key Architecture Notes
8735

@@ -90,12 +38,5 @@ Look for:
9038
- `ground_truth_meta.json` contains curator metadata: model, backend, prompt version, cost, timestamp
9139
- The curator uses phase1 prompt (`PHASE1_CLI_PROMPTS` + `PHASE1_SUFFIX`) which is recall-focused (F1=0.749 on calibration set)
9240
- Hybrid backend = local tools (Bash, Read, Glob, Grep) + Sourcegraph MCP (sg_keyword_search, sg_nls_search)
93-
94-
## Daytona Runner Changes (for reference)
95-
96-
The runner was hardened in this session to prevent orphaned sandbox accumulation:
97-
1. `cleanup_orphaned_sandboxes()` runs at startup and shutdown
98-
2. `auto_stop_interval=20` (minutes) — sandboxes auto-stop if idle
99-
3. `auto_archive_interval=60` — auto-archive after 1 hour
100-
4. SIGTERM/SIGINT signal handler cancels futures and triggers cleanup
101-
5. `DEFAULT_PARALLEL=55` (was 20) — matches Tier 3 capacity (250 vCPU / 2 per sandbox = 125 max, minus headroom)
41+
- `extract_v2_report_data.py` scans both `runs/official/` and `runs/official/_raw/` via `scan_roots` loop
42+
- `ccx-onboard-search-*` tasks: `ground_truth.json` keeps `function_id` schema for verifier; `oracle_answer.json` uses standard `files` schema for IR

0 commit comments

Comments
 (0)