You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+16-12Lines changed: 16 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,23 +118,27 @@ Context retrieval isn't the bottleneck for every software development situation.
118
118
119
119
## MCP Value Scales With Codebase Size
120
120
121
-
Note: the detailed repository-size table below is from an earlier V2 slice and is pending full re-extraction for the refreshed March 3 analysis snapshot.
121
+
For the fully refreshed pass, I used task-level size proxies that are present for this dataset (`context_length` and `files_count`) with multi-run averages per task/config:
122
122
123
-
One of the clearest patterns in the data: MCP's benefit increases monotonically with codebase size. We pulled repo sizes from the GitHub API for 365 of the 370 tasks and grouped them into bins:
123
+
| Context Size Proxy | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
For the smallest repos, MCP slightly hurts reward and adds cost. At 10–50 MB you hit the sweet spot: better outcomes, much faster, and $2.74/task cheaper. Above 1 GB the reward lift is largest (+0.085), which lines up with what you'd expect — massive monorepo-scale codebases like Kubernetes and Chromium are exactly where retrieval tools shine because the agent can't feasibly grep through tens of millions of lines. Agent execution time is shorter with MCP across *every* size category.
131
+
| Files Count Bin | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
Breaking it down by difficulty: hard tasks (91% of the benchmark) show the best MCP profile — better reward (+0.023), faster (−58s wall clock, −95s agent), and cheaper (−$0.42/task). Expert tasks show a slight negative reward delta (−0.019) and much higher cost (+$3.01), suggesting that at the highest complexity tier the agent burns tokens on MCP searches that don't pay off.
137
+
So in this refreshed slice, MCP reward delta is positive across all available size-proxy bins.
136
138
137
-
By language, Go repos see the biggest cost savings (−$1.18/task, n=134), Rust sees the biggest wall-clock savings (−358s, n=12), and Python gets the best reward lift (+0.040, n=55). TypeScript is the only language where MCP hurts across all dimensions, though n=7 is too small to draw strong conclusions.
139
+
Breaking it down by difficulty (with variance): hard tasks remain positive (+0.038, var 0.046768), medium tasks are most positive (+0.115, var 0.053039), and expert tasks remain negative (−0.057, var 0.070557).
140
+
141
+
By language, the largest positive reward deltas are JavaScript (+0.135, n=8), Python (+0.070, n=55), and Go (+0.052, n=134). TypeScript is still the strongest negative outlier (−0.140, n=7).
0 commit comments