Skip to content

Commit d600dda

Browse files
committed
Refresh size-proxy analysis and variance tables in report/blog
1 parent 84a503d commit d600dda

File tree

4 files changed

+77
-13
lines changed

4 files changed

+77
-13
lines changed

docs/BLOG_POST.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -118,23 +118,27 @@ Context retrieval isn't the bottleneck for every software development situation.
118118

119119
## MCP Value Scales With Codebase Size
120120

121-
Note: the detailed repository-size table below is from an earlier V2 slice and is pending full re-extraction for the refreshed March 3 analysis snapshot.
121+
For the fully refreshed pass, I used task-level size proxies that are present for this dataset (`context_length` and `files_count`) with multi-run averages per task/config:
122122

123-
One of the clearest patterns in the data: MCP's benefit increases monotonically with codebase size. We pulled repo sizes from the GitHub API for 365 of the 370 tasks and grouped them into bins:
123+
| Context Size Proxy | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
124+
|--------------------|---|---------|----------|----------|---------------|
125+
| <100K tokens | 222 | 0.400 | 0.433 | +0.034 | 0.026862 |
126+
| 100K–1M tokens | 98 | 0.639 | 0.670 | +0.031 | 0.093518 |
127+
| unknown | 50 | 0.523 | 0.571 | +0.048 | 0.059717 |
124128

125-
| Repo Size | Approx LoC | n | Δ Reward | Δ Wall Clock | Δ Agent Exec | Δ $/task |
126-
|-----------|-----------|---|----------|-------------|-------------|---------|
127-
| <10 MB | <400K | 60 | −0.007 | −23s | −52s | +$1.39 |
128-
| 10–50 MB | 0.4–2M | 61 | **+0.043** | **−153s** | **−137s** | **−$2.74** |
129-
| 50–200 MB | 2–8M | 113 | +0.027 | +64s | −51s | +$0.03 |
130-
| 200MB–1GB | 8–40M | 104 | +0.033 | −13s | −97s | +$0.02 |
131-
| >1 GB | >40M | 27 | **+0.085** | −135s | −123s | +$0.06 |
129+
And by file-count bins:
132130

133-
For the smallest repos, MCP slightly hurts reward and adds cost. At 10–50 MB you hit the sweet spot: better outcomes, much faster, and $2.74/task cheaper. Above 1 GB the reward lift is largest (+0.085), which lines up with what you'd expect — massive monorepo-scale codebases like Kubernetes and Chromium are exactly where retrieval tools shine because the agent can't feasibly grep through tens of millions of lines. Agent execution time is shorter with MCP across *every* size category.
131+
| Files Count Bin | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
132+
|----------------|---|---------|----------|----------|---------------|
133+
| <10 | 168 | 0.327 | 0.375 | +0.048 | 0.032454 |
134+
| 10–100 | 91 | 0.676 | 0.699 | +0.023 | 0.097068 |
135+
| unknown | 111 | 0.550 | 0.575 | +0.025 | 0.034117 |
134136

135-
Breaking it down by difficulty: hard tasks (91% of the benchmark) show the best MCP profile — better reward (+0.023), faster (−58s wall clock, −95s agent), and cheaper (−$0.42/task). Expert tasks show a slight negative reward delta (−0.019) and much higher cost (+$3.01), suggesting that at the highest complexity tier the agent burns tokens on MCP searches that don't pay off.
137+
So in this refreshed slice, MCP reward delta is positive across all available size-proxy bins.
136138

137-
By language, Go repos see the biggest cost savings (−$1.18/task, n=134), Rust sees the biggest wall-clock savings (−358s, n=12), and Python gets the best reward lift (+0.040, n=55). TypeScript is the only language where MCP hurts across all dimensions, though n=7 is too small to draw strong conclusions.
139+
Breaking it down by difficulty (with variance): hard tasks remain positive (+0.038, var 0.046768), medium tasks are most positive (+0.115, var 0.053039), and expert tasks remain negative (−0.057, var 0.070557).
140+
141+
By language, the largest positive reward deltas are JavaScript (+0.135, n=8), Python (+0.070, n=55), and Go (+0.052, n=134). TypeScript is still the strongest negative outlier (−0.140, n=7).
138142

139143
## Retrieval Differences
140144

docs/analysis/analysis_refresh_tables_20260303.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -410,5 +410,28 @@
410410
32
411411
]
412412
]
413+
},
414+
"paired_by_context_length_bin": {
415+
"<100K tokens": {
416+
"n": 222,
417+
"baseline_reward_mean": 0.3996,
418+
"mcp_reward_mean": 0.4332,
419+
"mean_reward_delta": 0.0336,
420+
"reward_delta_variance": 0.026862
421+
},
422+
"100K-1M tokens": {
423+
"n": 98,
424+
"baseline_reward_mean": 0.6394,
425+
"mcp_reward_mean": 0.6704,
426+
"mean_reward_delta": 0.0311,
427+
"reward_delta_variance": 0.093518
428+
},
429+
"unknown": {
430+
"n": 50,
431+
"baseline_reward_mean": 0.5232,
432+
"mcp_reward_mean": 0.5712,
433+
"mean_reward_delta": 0.048,
434+
"reward_delta_variance": 0.059717
435+
}
413436
}
414437
}

docs/analysis/analysis_set_metrics_20260303.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -410,5 +410,28 @@
410410
32
411411
]
412412
]
413+
},
414+
"paired_by_context_length_bin": {
415+
"<100K tokens": {
416+
"n": 222,
417+
"baseline_reward_mean": 0.3996,
418+
"mcp_reward_mean": 0.4332,
419+
"mean_reward_delta": 0.0336,
420+
"reward_delta_variance": 0.026862
421+
},
422+
"100K-1M tokens": {
423+
"n": 98,
424+
"baseline_reward_mean": 0.6394,
425+
"mcp_reward_mean": 0.6704,
426+
"mean_reward_delta": 0.0311,
427+
"reward_delta_variance": 0.093518
428+
},
429+
"unknown": {
430+
"n": 50,
431+
"baseline_reward_mean": 0.5232,
432+
"mcp_reward_mean": 0.5712,
433+
"mean_reward_delta": 0.048,
434+
"reward_delta_variance": 0.059717
435+
}
413436
}
414437
}

docs/technical_reports/TECHNICAL_REPORT_V2.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -988,7 +988,21 @@ The benchmark remains dominated by hard tasks. In this refreshed aggregation, ha
988988
989989
### 11.9 Impact by Codebase Size
990990
991-
Repository-size bins are not available in the refreshed analysis artifact, but file-count bins are:
991+
Codebase-size analysis in the refreshed pass uses two available proxies:
992+
1) `context_length` bins from task metadata, and
993+
2) `files_count` bins.
994+
995+
**By context-length bin:**
996+
997+
| Context Length Bin | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
998+
|--------------------|---|---------|----------|----------|---------------|
999+
| <100K tokens | 222 | 0.400 | 0.433 | +0.034 | 0.026862 |
1000+
| 100K-1M tokens | 98 | 0.639 | 0.670 | +0.031 | 0.093518 |
1001+
| unknown | 50 | 0.523 | 0.571 | +0.048 | 0.059717 |
1002+
1003+
MCP reward delta is positive across all context-size bins in this refreshed slice.
1004+
1005+
**By files-count bin:**
9921006
9931007
| Files Count Bin | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
9941008
|----------------|---|---------|----------|----------|---------------|

0 commit comments

Comments
 (0)