Skip to content

Commit 8caa3ef

Browse files
committed
docs: correct org retrieval breakdown path normalization
1 parent a9b7e8b commit 8caa3ef

File tree

2 files changed

+28
-25
lines changed

2 files changed

+28
-25
lines changed

docs/codescalebench_blog_v1.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -87,36 +87,38 @@ Scored tasks in this slice:
8787

8888
| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) |
8989
|---|---:|---|---|---|---|---|---|---|
90-
| Org | 206 | 0.000 / 0.365 | 0.000 / 0.262 | 0.000 / 0.275 | 0.001 / 0.245 | 0.001 / 0.314 | 0.001 / 0.246 | 0.001 / 0.322 |
91-
| SDLC | 123 | 0.361 / 0.455 | 0.272 / 0.373 | 0.268 / 0.350 | 0.242 / 0.293 | 0.327 / 0.431 | 0.239 / 0.297 | 0.345 / 0.438 |
92-
| Combined | 329 | 0.135 / 0.399 | 0.102 / 0.304 | 0.100 / 0.303 | 0.091 / 0.263 | 0.123 / 0.358 | 0.090 / 0.265 | 0.129 / 0.365 |
90+
| Org | 206 | 0.007 / 0.471 | 0.002 / 0.149 | 0.003 / 0.206 | 0.007 / 0.311 | 0.004 / 0.177 | 0.005 / 0.200 | 0.005 / 0.182 |
91+
| SDLC | 123 | 0.364 / 0.489 | 0.261 / 0.371 | 0.260 / 0.356 | 0.243 / 0.318 | 0.314 / 0.430 | 0.234 / 0.308 | 0.331 / 0.437 |
92+
| Combined | 329 | 0.140 / 0.478 | 0.099 / 0.232 | 0.099 / 0.262 | 0.095 / 0.313 | 0.120 / 0.272 | 0.091 / 0.240 | 0.127 / 0.277 |
9393

9494
### Pre-existing GT (`ground_truth.json` / `oracle_answer.json`)
9595

9696
| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) |
9797
|---|---:|---|---|---|---|---|---|---|
98-
| Org | 206 | 0.000 / 0.122 | 0.000 / 0.121 | 0.000 / 0.113 | 0.000 / 0.074 | 0.000 / 0.137 | 0.000 / 0.090 | 0.000 / 0.139 |
99-
| SDLC | 123 | 0.296 / 0.379 | 0.288 / 0.405 | 0.262 / 0.347 | 0.192 / 0.231 | 0.335 / 0.458 | 0.216 / 0.274 | 0.347 / 0.471 |
100-
| Combined | 329 | 0.111 / 0.218 | 0.108 / 0.227 | 0.098 / 0.200 | 0.072 / 0.133 | 0.125 / 0.257 | 0.081 / 0.159 | 0.130 / 0.263 |
98+
| Org | 206 | 0.011 / 0.199 | 0.005 / 0.089 | 0.006 / 0.114 | 0.008 / 0.124 | 0.008 / 0.104 | 0.007 / 0.105 | 0.008 / 0.106 |
99+
| SDLC | 123 | 0.301 / 0.397 | 0.296 / 0.410 | 0.266 / 0.354 | 0.194 / 0.241 | 0.343 / 0.463 | 0.218 / 0.280 | 0.356 / 0.476 |
100+
| Combined | 329 | 0.119 / 0.273 | 0.114 / 0.209 | 0.103 / 0.204 | 0.078 / 0.167 | 0.133 / 0.238 | 0.086 / 0.170 | 0.138 / 0.244 |
101101

102102
## MCP Value Highlights from the New Retrieval Slices
103103

104104
### 1) Multi-repo tasks benefit more than single-repo tasks
105105

106106
Curated GT deltas (`MCP - baseline`, combined):
107-
- `single_repo` (n=159): **F1@10 +0.1075**, **Total Recall +0.1658**
108-
- `multi_repo` (n=170): **F1@10 +0.2387**, **Total Recall +0.3017**
107+
- `single_repo` (n=158): **F1@10 +0.0853**, **Total Recall +0.1119**
108+
- `multi_repo` (n=171): **F1@10 +0.2089**, **Total Recall +0.1862**
109109

110110
### 2) Gains persist across size bins, with strongest lift in 1M-5M proxy bucket
111111

112112
Curated GT deltas (`MCP - baseline`):
113-
- `<1M`: F1@10 +0.1047, Total +0.1736
114-
- `1M-5M`: F1@10 +0.3417, Total +0.4148
115-
- `5M-20M`: F1@10 +0.0696, Total +0.0960
116-
- `>20M`: F1@10 +0.1653, Total +0.2104
113+
- `<1M`: F1@10 +0.1007, Total +0.1318
114+
- `1M-5M`: F1@10 +0.2680, Total +0.2392
115+
- `5M-20M`: F1@10 +0.0648, Total +0.0565
116+
- `>20M`: F1@10 +0.1247, Total +0.1075
117117

118118
Interpretation: retrieval lift is not uniform, but MCP shows clear upside where task context is more distributed and retrieval-heavy.
119119

120+
Method note: I corrected an Org path-normalization bug in an earlier draft where some baseline paths were mismatched due to path shape differences (for example `repo/repo/path` vs `repo/path`).
121+
120122
## Cost and Speed
121123

122124
Current paired means:
@@ -170,4 +172,3 @@ Planned next steps:
170172
3. Compare alternate MCP providers on the same task set.
171173
4. Run tool-policy experiments (especially semantic/deep-search nudges).
172174
5. Continue tightening verifier and QA infrastructure before final white paper publication.
173-

docs/technical_reports/TECHNICAL_REPORT_V2.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -957,6 +957,8 @@ To isolate retrieval quality effects on the currently curated task set, we recom
957957
958958
Output artifact: `results/ir/baseline_vs_mcp_breakdown_org_sdlc_runs_analysis_20260304.json`.
959959
960+
Correction note: an earlier draft of this subsection undercounted Org baseline matches due to path-shape normalization differences (for example `repo/repo/path` vs `repo/path`). Numbers below use corrected canonical exact matching.
961+
960962
Coverage in this slice:
961963
- Scored task pairs: **329** (`org=206`, `sdlc=123`)
962964
- Metrics shown: Precision@5, Recall@5, F1@5, Precision@10, Recall@10, F1@10, and full-set `total_file_recall`
@@ -965,9 +967,9 @@ Coverage in this slice:
965967
966968
| Group | n | P@5 (BL / MCP) | R@5 (BL / MCP) | F1@5 (BL / MCP) | P@10 (BL / MCP) | R@10 (BL / MCP) | F1@10 (BL / MCP) | Total File Recall (BL / MCP) |
967969
|-------|---:|----------------|----------------|-----------------|-----------------|-----------------|------------------|-------------------------------|
968-
| Org | 206 | 0.000 / 0.365 | 0.000 / 0.262 | 0.000 / 0.275 | 0.001 / 0.245 | 0.001 / 0.314 | 0.001 / 0.246 | 0.001 / 0.322 |
969-
| SDLC | 123 | 0.361 / 0.455 | 0.272 / 0.373 | 0.268 / 0.350 | 0.242 / 0.293 | 0.327 / 0.431 | 0.239 / 0.297 | 0.345 / 0.438 |
970-
| Combined | 329 | 0.135 / 0.399 | 0.102 / 0.304 | 0.100 / 0.303 | 0.091 / 0.263 | 0.123 / 0.358 | 0.090 / 0.265 | 0.129 / 0.365 |
970+
| Org | 206 | 0.007 / 0.471 | 0.002 / 0.149 | 0.003 / 0.206 | 0.007 / 0.311 | 0.004 / 0.177 | 0.005 / 0.200 | 0.005 / 0.182 |
971+
| SDLC | 123 | 0.364 / 0.489 | 0.261 / 0.371 | 0.260 / 0.356 | 0.243 / 0.318 | 0.314 / 0.430 | 0.234 / 0.308 | 0.331 / 0.437 |
972+
| Combined | 329 | 0.140 / 0.478 | 0.099 / 0.232 | 0.099 / 0.262 | 0.095 / 0.313 | 0.120 / 0.272 | 0.091 / 0.240 | 0.127 / 0.277 |
971973
972974
Key interpretation:
973975
- MCP improves retrieval substantially on both benchmark families in the curated set.
@@ -977,27 +979,27 @@ Key interpretation:
977979
978980
| Group | n | P@5 (BL / MCP) | R@5 (BL / MCP) | F1@5 (BL / MCP) | P@10 (BL / MCP) | R@10 (BL / MCP) | F1@10 (BL / MCP) | Total File Recall (BL / MCP) |
979981
|-------|---:|----------------|----------------|-----------------|-----------------|-----------------|------------------|-------------------------------|
980-
| Org | 206 | 0.000 / 0.122 | 0.000 / 0.121 | 0.000 / 0.113 | 0.000 / 0.074 | 0.000 / 0.137 | 0.000 / 0.090 | 0.000 / 0.139 |
981-
| SDLC | 123 | 0.296 / 0.379 | 0.288 / 0.405 | 0.262 / 0.347 | 0.192 / 0.231 | 0.335 / 0.458 | 0.216 / 0.274 | 0.347 / 0.471 |
982-
| Combined | 329 | 0.111 / 0.218 | 0.108 / 0.227 | 0.098 / 0.200 | 0.072 / 0.133 | 0.125 / 0.257 | 0.081 / 0.159 | 0.130 / 0.263 |
982+
| Org | 206 | 0.011 / 0.199 | 0.005 / 0.089 | 0.006 / 0.114 | 0.008 / 0.124 | 0.008 / 0.104 | 0.007 / 0.105 | 0.008 / 0.106 |
983+
| SDLC | 123 | 0.301 / 0.397 | 0.296 / 0.410 | 0.266 / 0.354 | 0.194 / 0.241 | 0.343 / 0.463 | 0.218 / 0.280 | 0.356 / 0.476 |
984+
| Combined | 329 | 0.119 / 0.273 | 0.114 / 0.209 | 0.103 / 0.204 | 0.078 / 0.167 | 0.133 / 0.238 | 0.086 / 0.170 | 0.138 / 0.244 |
983985
984986
#### Correlation Slices: Multi-Repo and Size Effects
985987
986988
On curated ground truth (`MCP - Baseline`, combined):
987989
988990
| Slice | n | Δ F1@10 | Δ Total File Recall |
989991
|-------|---:|--------:|--------------------:|
990-
| single_repo | 159 | +0.1075 | +0.1658 |
991-
| multi_repo | 170 | +0.2387 | +0.3017 |
992+
| single_repo | 158 | +0.0853 | +0.1119 |
993+
| multi_repo | 171 | +0.2089 | +0.1862 |
992994
993995
Curated size-bin deltas (`MCP - Baseline`):
994996
995997
| Size Bin (proxy) | n | Δ F1@10 | Δ Total File Recall |
996998
|------------------|---:|--------:|--------------------:|
997-
| <1M | 144 | +0.1047 | +0.1736 |
998-
| 1M-5M | 99 | +0.3417 | +0.4148 |
999-
| 5M-20M | 57 | +0.0696 | +0.0960 |
1000-
| >20M | 29 | +0.1653 | +0.2104 |
999+
| <1M | 139 | +0.1007 | +0.1318 |
1000+
| 1M-5M | 104 | +0.2680 | +0.2392 |
1001+
| 5M-20M | 57 | +0.0648 | +0.0565 |
1002+
| >20M | 29 | +0.1247 | +0.1075 |
10011003
10021004
These slices indicate MCP retrieval gains are larger on multi-repo tasks than single-repo tasks in this snapshot.
10031005

0 commit comments

Comments
 (0)