You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/benchmark.md
+95-9Lines changed: 95 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,95 @@
1
-
# Discovery Benchmark
1
+
# Benchmarks
2
2
3
-
This page documents the current public discovery proof from the checked-in result artifacts on `master`.
3
+
This page tracks two separate benchmark surfaces:
4
+
5
+
- The ContextBench implementation pilot, which uses the official ContextBench evaluator on one frozen task and five scoreable lanes.
6
+
- The older discovery benchmark, which measures local discovery usefulness and payload cost only.
7
+
8
+
Neither section currently supports a broad benchmark-win claim.
9
+
10
+
## ContextBench Implementation Pilot
11
+
12
+
This is the current implementation-quality pilot for ContextBench. It is real scoreable evidence, but it is still a pilot because it covers one frozen task rather than the full frozen 20-task slice.
- Base commit: `537e2fc033b71a4a69190b74f755ebc352bb4196`
25
+
26
+
`CodeGraphContext` is not counted in this five-lane pilot because its supported CLI path indexed successfully but returned zero task-relevant candidates during readiness. That remains a readiness blocker, not a quality result.
The artifact contains `summary.json`, `publishable-summary.json`, `publishable-validation.json`, `humanized-summary.md`, logs, lane selections, lane predictions, and official evaluator score files. It intentionally excludes full cloned repos and evaluator caches so the evidence package is small enough to inspect.
40
+
41
+
### Quality Results
42
+
43
+
Only rows scored by the official ContextBench evaluator are included here. Setup failures, tool errors, empty predictions, and judge failures are reliability outcomes, not quality rows.
44
+
45
+
| Lane | File cov | File prec | Span cov | Span prec | Line cov | Line prec |
The report separates setup, indexing, query, selector, evaluator, and row-wall timing from quality. It also reports candidate counts, candidate token estimates when available, prediction token estimates, and selector token telemetry fields.
56
+
57
+
`n/a` means the measurement was explicitly unavailable, not zero and not a hidden failure. Current gaps are:
58
+
59
+
- Selector wall-clock and provider token telemetry were not captured in this proof artifact.
60
+
-`raw-native`, `codebase-context`, and `codebase-memory-mcp` emitted candidate counts but not candidate-pack bytes.
61
+
-`codebase-context` readiness did not emit index/query duration in the source artifact.
62
+
63
+
### Bias Controls
64
+
65
+
The generated `publishable-validation.json` must pass these checks before the report is treated as evidence:
66
+
67
+
- Quality rows come only from the official ContextBench evaluator.
68
+
- Failed or unscoreable rows stay out of the quality table.
69
+
- All required lanes are scoreable.
70
+
- Setup, index, and query costs are separate from quality.
71
+
- Timing and token fields exist or carry explicit unavailable reasons.
72
+
- The protocol is frozen and `claimAllowed` is `false`.
73
+
- The task manifest attests that lane outputs were not observed during task selection.
74
+
75
+
### What This Supports
76
+
77
+
- It supports saying that the benchmark harness can produce real official ContextBench scores across five lanes.
78
+
- It supports saying that setup/index/query cost and context/token cost are now tracked separately from quality.
79
+
- It supports saying that the one-task pilot found a strong `codebase-context` result on this specific task.
80
+
81
+
### What This Does Not Support
82
+
83
+
- It does not support claiming that `codebase-context` beats competitors overall.
84
+
- It does not support claiming patch correctness or productivity improvements.
85
+
- It does not replace the full frozen 20-task, repeated-run benchmark required for claim-bearing results.
86
+
87
+
## Discovery Benchmark
88
+
89
+
This section documents the current public discovery proof from the checked-in result artifacts on `master`.
4
90
It is a discovery benchmark, not an implementation-quality benchmark.
5
91
6
-
## Scope
92
+
## Discovery Scope
7
93
8
94
- Frozen fixtures:
9
95
-`tests/fixtures/discovery-angular-spotify.json`
@@ -17,7 +103,7 @@ It is a discovery benchmark, not an implementation-quality benchmark.
17
103
- Comparator evidence:
18
104
-`results/comparator-evidence.json`
19
105
20
-
## How To Reproduce
106
+
## Discovery Reproduction
21
107
22
108
Run the repo-local proof artifacts from the current `master` checkout:
@@ -58,7 +144,7 @@ The gate is intentionally still blocked.
58
144
-`codebase-memory-mcp` now has real current metrics, but the gate still marks it `failed` on the frozen tolerance rule
59
145
- Three comparator lanes still fail setup entirely: `GrepAI`, `jCodeMunch`, and `CodeGraphContext`.
60
146
61
-
## Comparator Reality
147
+
## Discovery Comparator Reality
62
148
63
149
The current comparator artifact records incomplete comparator evidence, not benchmark wins.
64
150
@@ -72,15 +158,15 @@ The current comparator artifact records incomplete comparator evidence, not benc
72
158
73
159
`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
74
160
75
-
## Important Limitations
161
+
## Discovery Important Limitations
76
162
77
163
- This benchmark measures discovery usefulness and payload cost only.
78
164
- It does not measure implementation correctness, patch quality, or end-to-end task completion.
79
165
- Comparator setup remains environment-sensitive, and the checked-in comparator outputs still do not satisfy the frozen claim gate.
80
166
- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness.
81
167
-`averageFirstRelevantHit` remains `null` in the current gate output, which is enough to keep the raw-Claude baseline in `pending_evidence`.
82
168
83
-
## What This Proof Can Support
169
+
## Discovery Claims Supported
84
170
85
171
- It can support claims about the shipped discovery surfaces and their current measured outputs on the frozen public tasks.
86
172
- It can support claims that the proof gate is still blocked by comparator evidence.
0 commit comments