Skip to content

Commit 86644d2

Browse files
committed
Document ContextBench implementation pilot evidence
1 parent bbd3a83 commit 86644d2

1 file changed

Lines changed: 95 additions & 9 deletions

File tree

docs/benchmark.md

Lines changed: 95 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,95 @@
1-
# Discovery Benchmark
1+
# Benchmarks
22

3-
This page documents the current public discovery proof from the checked-in result artifacts on `master`.
3+
This page tracks two separate benchmark surfaces:
4+
5+
- The ContextBench implementation pilot, which uses the official ContextBench evaluator on one frozen task and five scoreable lanes.
6+
- The older discovery benchmark, which measures local discovery usefulness and payload cost only.
7+
8+
Neither section currently supports a broad benchmark-win claim.
9+
10+
## ContextBench Implementation Pilot
11+
12+
This is the current implementation-quality pilot for ContextBench. It is real scoreable evidence, but it is still a pilot because it covers one frozen task rather than the full frozen 20-task slice.
13+
14+
### Scope
15+
16+
- Protocol: `tests/fixtures/contextbench-benchmark-protocol.json`
17+
- Task manifest: `tests/fixtures/contextbench-task-manifest.json`
18+
- Selection file: `scripts/contextbench-five-lane-selections.json`
19+
- Workflow: `.github/workflows/contextbench-five-lane-score.yml`
20+
- Required lanes: `raw-native`, `codebase-context`, `codebase-memory-mcp`, `grepai`, `ripgrep-lexical`
21+
- Model used for selection: `gpt-5.4-mini-high`
22+
- Target task: `SWE-Bench-Pro__go__maintenance__bugfix__4df06349`
23+
- Repository under test: `navidrome/navidrome`
24+
- Base commit: `537e2fc033b71a4a69190b74f755ebc352bb4196`
25+
26+
`CodeGraphContext` is not counted in this five-lane pilot because its supported CLI path indexed successfully but returned zero task-relevant candidates during readiness. That remains a readiness blocker, not a quality result.
27+
28+
### Current Audited Run
29+
30+
- Run: `25663469903`
31+
- Job: `75329796667`
32+
- Commit: `bbd3a8348aaec15809fd09dd8fc729e64df6d878`
33+
- Artifact: `6915576867`
34+
- Artifact digest: `sha256:718fd32049a2d98ed62fb0c15189d7dc9f1b027c202f286923de91d9f8985def`
35+
- Artifact size: `88.9 KB`
36+
- Uploaded files: `42`
37+
- Status: `success`
38+
39+
The artifact contains `summary.json`, `publishable-summary.json`, `publishable-validation.json`, `humanized-summary.md`, logs, lane selections, lane predictions, and official evaluator score files. It intentionally excludes full cloned repos and evaluator caches so the evidence package is small enough to inspect.
40+
41+
### Quality Results
42+
43+
Only rows scored by the official ContextBench evaluator are included here. Setup failures, tool errors, empty predictions, and judge failures are reliability outcomes, not quality rows.
44+
45+
| Lane | File cov | File prec | Span cov | Span prec | Line cov | Line prec |
46+
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
47+
| `raw-native` | 0.222 | 0.667 | 0.370 | 0.391 | 0.365 | 0.365 |
48+
| `codebase-context` | 0.889 | 1.000 | 0.899 | 0.356 | 0.887 | 0.323 |
49+
| `codebase-memory-mcp` | 0.222 | 0.667 | 0.346 | 0.380 | 0.315 | 0.337 |
50+
| `grepai` | 0.333 | 0.500 | 0.048 | 0.042 | 0.059 | 0.061 |
51+
| `ripgrep-lexical` | 0.222 | 0.667 | 0.401 | 0.341 | 0.419 | 0.302 |
52+
53+
### Cost And Telemetry
54+
55+
The report separates setup, indexing, query, selector, evaluator, and row-wall timing from quality. It also reports candidate counts, candidate token estimates when available, prediction token estimates, and selector token telemetry fields.
56+
57+
`n/a` means the measurement was explicitly unavailable, not zero and not a hidden failure. Current gaps are:
58+
59+
- Selector wall-clock and provider token telemetry were not captured in this proof artifact.
60+
- `raw-native`, `codebase-context`, and `codebase-memory-mcp` emitted candidate counts but not candidate-pack bytes.
61+
- `codebase-context` readiness did not emit index/query duration in the source artifact.
62+
63+
### Bias Controls
64+
65+
The generated `publishable-validation.json` must pass these checks before the report is treated as evidence:
66+
67+
- Quality rows come only from the official ContextBench evaluator.
68+
- Failed or unscoreable rows stay out of the quality table.
69+
- All required lanes are scoreable.
70+
- Setup, index, and query costs are separate from quality.
71+
- Timing and token fields exist or carry explicit unavailable reasons.
72+
- The protocol is frozen and `claimAllowed` is `false`.
73+
- The task manifest attests that lane outputs were not observed during task selection.
74+
75+
### What This Supports
76+
77+
- It supports saying that the benchmark harness can produce real official ContextBench scores across five lanes.
78+
- It supports saying that setup/index/query cost and context/token cost are now tracked separately from quality.
79+
- It supports saying that the one-task pilot found a strong `codebase-context` result on this specific task.
80+
81+
### What This Does Not Support
82+
83+
- It does not support claiming that `codebase-context` beats competitors overall.
84+
- It does not support claiming patch correctness or productivity improvements.
85+
- It does not replace the full frozen 20-task, repeated-run benchmark required for claim-bearing results.
86+
87+
## Discovery Benchmark
88+
89+
This section documents the current public discovery proof from the checked-in result artifacts on `master`.
490
It is a discovery benchmark, not an implementation-quality benchmark.
591

6-
## Scope
92+
## Discovery Scope
793

894
- Frozen fixtures:
995
- `tests/fixtures/discovery-angular-spotify.json`
@@ -17,7 +103,7 @@ It is a discovery benchmark, not an implementation-quality benchmark.
17103
- Comparator evidence:
18104
- `results/comparator-evidence.json`
19105

20-
## How To Reproduce
106+
## Discovery Reproduction
21107

22108
Run the repo-local proof artifacts from the current `master` checkout:
23109

@@ -28,7 +114,7 @@ node scripts/benchmark-comparators.mjs --repos repos/angular-spotify,repos/excal
28114
node scripts/run-eval.mjs repos/angular-spotify repos/excalidraw --mode=discovery --fixture-a=tests/fixtures/discovery-angular-spotify.json --fixture-b=tests/fixtures/discovery-excalidraw.json --competitor-results=results/comparator-evidence.json --skip-reindex --output=results/gate-evaluation.json
29115
```
30116

31-
## Current Result
117+
## Discovery Current Result
32118

33119
From `results/gate-evaluation.json`:
34120

@@ -47,7 +133,7 @@ Repo-level outputs from the same rerun:
47133
| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 |
48134
| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 |
49135

50-
## Gate Truth
136+
## Discovery Gate Truth
51137

52138
The gate is intentionally still blocked.
53139

@@ -58,7 +144,7 @@ The gate is intentionally still blocked.
58144
- `codebase-memory-mcp` now has real current metrics, but the gate still marks it `failed` on the frozen tolerance rule
59145
- Three comparator lanes still fail setup entirely: `GrepAI`, `jCodeMunch`, and `CodeGraphContext`.
60146

61-
## Comparator Reality
147+
## Discovery Comparator Reality
62148

63149
The current comparator artifact records incomplete comparator evidence, not benchmark wins.
64150

@@ -72,15 +158,15 @@ The current comparator artifact records incomplete comparator evidence, not benc
72158

73159
`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
74160

75-
## Important Limitations
161+
## Discovery Important Limitations
76162

77163
- This benchmark measures discovery usefulness and payload cost only.
78164
- It does not measure implementation correctness, patch quality, or end-to-end task completion.
79165
- Comparator setup remains environment-sensitive, and the checked-in comparator outputs still do not satisfy the frozen claim gate.
80166
- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness.
81167
- `averageFirstRelevantHit` remains `null` in the current gate output, which is enough to keep the raw-Claude baseline in `pending_evidence`.
82168

83-
## What This Proof Can Support
169+
## Discovery Claims Supported
84170

85171
- It can support claims about the shipped discovery surfaces and their current measured outputs on the frozen public tasks.
86172
- It can support claims that the proof gate is still blocked by comparator evidence.

0 commit comments

Comments
 (0)