Skip to content

Commit 287c09c

Browse files
committed
fix HELM numbers, chart axes, and unbroken gridlines
Cross-checked HELM cost claims against Section 6 model table (p. 43): replaced loose "$10K or 4,000+ GPU-hours per model" with actual range, corrected aggregate from "high six figures" to ~$100K, and updated the cost-summary table entry. Fixed Pythia "16 model sizes" → "16 models spanning 8 sizes". Relabeled ResearchGym row to "full pass (3 seeds)" so the dollars match the GPU-hours. Chart fixes: axis labels now align with bar positions (flex space-between instead of grid with centered labels). Figure 2 axis converted to uniform decades ($100/$1k/$10k/$100k); all bars recomputed and small ~1% errors corrected. Figure 3 caption clarifies that bars show maximum compression, not ranges. Vertical gridlines are now continuous across all rows (chart-body wrapper with absolute-positioned ::before instead of per-track backgrounds). Each figure sets its own --grid-interval. Mobile keeps the per-track gradient. Removed three stray image-markdown references accidentally pasted into "consequences" in the closing section.
1 parent c81e2b6 commit 287c09c

1 file changed

Lines changed: 51 additions & 27 deletions

File tree

_posts/2026-04-25-eval-costs-bottleneck.md

Lines changed: 51 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -231,13 +231,12 @@ description: "A field guide to evaluation costs: where the money goes, why old c
231231
color: var(--fg-subtle);
232232
}
233233
.eval-cost-article .axis-scale {
234-
display: grid;
235-
grid-template-columns: repeat(6, 1fr);
234+
display: flex;
235+
justify-content: space-between;
236236
border-bottom: 1px solid var(--border-strong);
237237
padding-bottom: 4px;
238238
}
239239
.eval-cost-article .axis-scale span {
240-
text-align: center;
241240
font-variant-numeric: tabular-nums;
242241
white-space: nowrap;
243242
}
@@ -269,15 +268,33 @@ description: "A field guide to evaluation costs: where the money goes, why old c
269268
font-family: 'IBM Plex Mono', monospace;
270269
font-size: 11.5px;
271270
}
271+
.eval-cost-article .chart-body {
272+
position: relative;
273+
--grid-interval: 20%;
274+
}
275+
.eval-cost-article .chart-body::before {
276+
content: "";
277+
position: absolute;
278+
left: 228px;
279+
right: 106px;
280+
top: 0;
281+
bottom: 0;
282+
background-image: repeating-linear-gradient(to right,
283+
transparent 0,
284+
transparent calc(var(--grid-interval) - 1px),
285+
var(--border) calc(var(--grid-interval) - 1px),
286+
var(--border) var(--grid-interval));
287+
pointer-events: none;
288+
z-index: 0;
289+
}
290+
.eval-cost-article .chart-body .chart-row {
291+
position: relative;
292+
z-index: 1;
293+
}
272294
.eval-cost-article .range-track,
273295
.eval-cost-article .bar-track {
274296
position: relative;
275297
height: 22px;
276-
background: repeating-linear-gradient(to right,
277-
transparent 0,
278-
transparent calc(20% - 1px),
279-
var(--border) calc(20% - 1px),
280-
var(--border) 20%);
281298
}
282299
.eval-cost-article .range-bar {
283300
position: absolute;
@@ -418,6 +435,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
418435
}
419436

420437
@media (max-width: 760px) {
438+
.eval-cost-article .chart-body::before { display: none; }
421439
.eval-cost-article { font-size: 16.5px; line-height: 1.68; }
422440
.eval-cost-article h2 { margin-top: 58px; letter-spacing: -0.025em; }
423441
.eval-cost-article .figure { margin: 36px auto; }
@@ -482,9 +500,9 @@ description: "A field guide to evaluation costs: where the money goes, why old c
482500

483501
<h2 id="making-static-llm-benchmarks-cheaper">Making static LLM benchmarks cheaper</h2>
484502

485-
<p>The cost problem started before agents. When Stanford's CRFM released <a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer" target="_blank">HELM</a> in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al. (2023)</a> restate that figure, and <a href="https://research.ibm.com/blog/efficient-llm-benchmarking" rel="noopener noreferrer" target="_blank">IBM Research</a> notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures. </p>
503+
<p>The cost problem started before agents. When Stanford's CRFM released <a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer" target="_blank">HELM</a> in 2022, the paper's own per-model accounting (Section 6, p. 43) showed API costs ranging from $169 for OpenAI's ada (350M) to $10,926 for AI21's J1-Jumbo (178B), and 540 to 4,200 GPU-hours for the open models, with BLOOM (176B) and OPT (175B) at the top end. <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al. (2023)</a> restate those figures, and <a href="https://research.ibm.com/blog/efficient-llm-benchmarking" rel="noopener noreferrer" target="_blank">IBM Research</a> notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Across HELM's 30 models and 42 scenarios, the aggregate of reported costs and GPU compute came to roughly $100,000.</p>
486504

487-
<p>The more striking observation came from <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al.'s analysis</a> of <a href="https://arxiv.org/abs/2304.01373" rel="noopener noreferrer" target="_blank">EleutherAI's Pythia</a> checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al. (2024)</a> noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.</p>
505+
<p>The more striking observation came from <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al.'s analysis</a> of <a href="https://arxiv.org/abs/2304.01373" rel="noopener noreferrer" target="_blank">EleutherAI's Pythia</a> checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 models spanning 8 sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: <a href="https://arxiv.org/abs/2308.11696v5" rel="noopener noreferrer" target="_blank">Perlitz et al. (2024)</a> noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.</p>
488506

489507
<p>Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information; it was confirming rankings that the field could have inferred much more cheaply.</p>
490508

@@ -507,13 +525,15 @@ description: "A field guide to evaluation costs: where the money goes, why old c
507525
</div>
508526
<div aria-label="Per-run cost ranges on agent benchmarks from 19 cents to 2829 dollars." class="responsive-chart" role="img">
509527
<div class="axis"><span></span><div class="axis-scale"><span>$0.10</span><span>$1</span><span>$10</span><span>$100</span><span>$1k</span><span>$10k</span></div><span></span><div class="axis-label">Per-run cost (USD, log scale)</div></div>
510-
<div class="chart-row"><div class="chart-label">ScienceAgentBench</div><div class="range-track"><span class="range-bar" style="--min:5.56%;--max:57.33%;"></span></div><div class="chart-value">$0.19–$77</div></div>
511-
<div class="chart-row"><div class="chart-label">TAU-bench Airline</div><div class="range-track"><span class="range-bar" style="--min:9.83%;--max:65.11%;"></span></div><div class="chart-value">$0.31–$180</div></div>
528+
<div class="chart-body" style="--grid-interval: 20%;">
529+
<div class="chart-row"><div class="chart-label">ScienceAgentBench</div><div class="range-track"><span class="range-bar" style="--min:5.58%;--max:57.73%;"></span></div><div class="chart-value">$0.19–$77</div></div>
530+
<div class="chart-row"><div class="chart-label">TAU-bench Airline</div><div class="range-track"><span class="range-bar" style="--min:9.82%;--max:65.10%;"></span></div><div class="chart-value">$0.31–$180</div></div>
512531
<div class="chart-row"><div class="chart-label">CORE-Bench Hard</div><div class="range-track"><span class="range-bar" style="--min:26.02%;--max:74.15%;"></span></div><div class="chart-value">$2–$510</div></div>
513532
<div class="chart-row"><div class="chart-label">SciCode</div><div class="range-track"><span class="range-bar" style="--min:1.58%;--max:75.92%;"></span></div><div class="chart-value">$0.12–$625</div></div>
514533
<div class="chart-row"><div class="chart-label">SWE-bench Verified Mini</div><div class="range-track"><span class="range-bar" style="--min:32.04%;--max:84.08%;--series:var(--eval-warn);"></span></div><div class="chart-value">$4–$1,600</div></div>
515-
<div class="chart-row"><div class="chart-label">Online Mind2Web</div><div class="range-track"><span class="range-bar" style="--min:33.98%;--max:84.14%;--series:var(--eval-warn);"></span></div><div class="chart-value">$5–$1,610</div></div>
516-
<div class="chart-row"><div class="chart-label">GAIA</div><div class="range-track"><span class="range-bar" style="--min:37.84%;--max:89.03%;--series:var(--eval-warn);"></span></div><div class="chart-value">$7.80–$2,829</div></div>
534+
<div class="chart-row"><div class="chart-label">Online Mind2Web</div><div class="range-track"><span class="range-bar" style="--min:33.98%;--max:84.13%;--series:var(--eval-warn);"></span></div><div class="chart-value">$5–$1,610</div></div>
535+
<div class="chart-row"><div class="chart-label">GAIA</div><div class="range-track"><span class="range-bar" style="--min:37.84%;--max:89.04%;--series:var(--eval-warn);"></span></div><div class="chart-value">$7.80–$2,829</div></div>
536+
</div>
517537
</div>
518538
<figcaption class="figure-caption"><strong>Figure 1.</strong> Each bar shows the minimum-to-maximum cost across HAL configurations on a single benchmark. Highlighted bars cross the round $1,000-per-run threshold. A "run" is one full agent evaluation across all tasks. Within-benchmark spread reflects the model × scaffold × token-budget product. Source: live HAL leaderboard, April 2026.</figcaption>
519539
</figure>
@@ -558,14 +578,16 @@ description: "A field guide to evaluation costs: where the money goes, why old c
558578
<span class="legend-item"><span class="legend-swatch red"></span>$5,000 or more</span>
559579
</div>
560580
<div aria-label="Training-in-the-loop benchmark costs range from about 540 dollars to 11500 dollars." class="responsive-chart" role="img">
561-
<div class="axis"><span></span><div class="axis-scale"><span>$100</span><span>$500</span><span>$1k</span><span>$5k</span><span>$10k</span><span>$20k</span></div><span></span><div class="axis-label">USD per single evaluation (log scale)</div></div>
562-
<div class="chart-row"><div class="chart-label">ResearchGym (1 seed)</div><div class="range-track"><span class="range-bar" style="--min:31.83%;--max:47.83%;"></span></div><div class="chart-value">$540–$1,260</div></div>
563-
<div class="chart-row"><div class="chart-label">RE-Bench (full agent)</div><div class="range-track"><span class="range-bar" style="--min:46.91%;--max:53.56%;"></span></div><div class="chart-value">$1,200–$1,800</div></div>
564-
<div class="chart-row"><div class="chart-label">The Well (per architecture)</div><div class="range-track"><span class="range-bar" style="--min:54.77%;--max:62.37%;"></span></div><div class="chart-value">$1,920–$2,880</div></div>
565-
<div class="chart-row"><div class="chart-label">MLE-Bench (1 seed)</div><div class="range-track"><span class="range-bar" style="--min:61.15%;--max:63.16%;"></span></div><div class="chart-value">~$2,800</div></div>
566-
<div class="chart-row"><div class="chart-label">PaperBench Code-Dev</div><div class="range-track"><span class="range-bar" style="--min:70.68%;--max:70.68%;"></span></div><div class="chart-value">~$4,200</div></div>
567-
<div class="chart-row"><div class="chart-label">The Well (full sweep)</div><div class="range-track"><span class="range-bar" style="--min:82.01%;--max:89.60%;--series:var(--eval-warn);"></span></div><div class="chart-value">$7,700–$11,500</div></div>
568-
<div class="chart-row"><div class="chart-label">PaperBench (full)</div><div class="range-track"><span class="range-bar" style="--min:85.97%;--max:85.97%;--series:var(--eval-warn);"></span></div><div class="chart-value">~$9,500</div></div>
581+
<div class="axis"><span></span><div class="axis-scale"><span>$100</span><span>$1k</span><span>$10k</span><span>$100k</span></div><span></span><div class="axis-label">USD per single evaluation (log scale)</div></div>
582+
<div class="chart-body" style="--grid-interval: 33.333%;">
583+
<div class="chart-row"><div class="chart-label">ResearchGym (full pass, 3 seeds)</div><div class="range-track"><span class="range-bar" style="--min:24.41%;--max:36.68%;"></span></div><div class="chart-value">$540–$1,260</div></div>
584+
<div class="chart-row"><div class="chart-label">RE-Bench (full agent)</div><div class="range-track"><span class="range-bar" style="--min:35.97%;--max:41.84%;"></span></div><div class="chart-value">$1,200–$1,800</div></div>
585+
<div class="chart-row"><div class="chart-label">The Well (per architecture)</div><div class="range-track"><span class="range-bar" style="--min:42.78%;--max:48.65%;"></span></div><div class="chart-value">$1,920–$2,880</div></div>
586+
<div class="chart-row"><div class="chart-label">MLE-Bench (1 seed)</div><div class="range-track"><span class="range-bar" style="--min:47.71%;--max:49.24%;"></span></div><div class="chart-value">$2,700–$3,000</div></div>
587+
<div class="chart-row"><div class="chart-label">PaperBench Code-Dev</div><div class="range-track"><span class="range-bar" style="--min:54.11%;--max:54.11%;"></span></div><div class="chart-value">~$4,200</div></div>
588+
<div class="chart-row"><div class="chart-label">The Well (full sweep)</div><div class="range-track"><span class="range-bar" style="--min:62.88%;--max:68.69%;--series:var(--eval-warn);"></span></div><div class="chart-value">$7,700–$11,500</div></div>
589+
<div class="chart-row"><div class="chart-label">PaperBench (full)</div><div class="range-track"><span class="range-bar" style="--min:65.92%;--max:65.92%;--series:var(--eval-warn);"></span></div><div class="chart-value">~$9,500</div></div>
590+
</div>
569591
</div>
570592
<figcaption class="figure-caption"><strong>Figure 2.</strong> All values in USD per single evaluation of one model or agent through the full benchmark protocol. GPU costs converted at $2.50/H100-hr, $1.50/A10-hr; API and grading costs included where applicable. Highlighted bars denote benchmarks costing at least the round $5,000-per-evaluation threshold. The most expensive of these match the most expensive agent benchmarks (Figure 1) but require GPU compute that has no API substitute.</figcaption>
571593
</figure>
@@ -578,16 +600,18 @@ description: "A field guide to evaluation costs: where the money goes, why old c
578600
<div class="chart-title">Compression factors achievable by benchmark type</div>
579601
<div class="chart-subtitle">Maximum reduction in evaluation compute that preserves model-rank fidelity, log scale</div>
580602
<div aria-label="Color legend" class="chart-legend">
581-
<span class="legend-item"><span class="legend-swatch block"></span>Measured compression</span>
603+
<span class="legend-item"><span class="legend-swatch block"></span>Maximum measured compression</span>
582604
<span class="legend-item"><span class="legend-swatch block red"></span>No general compression method</span>
583605
</div>
584606
<div aria-label="Static benchmarks compress by about 100 to 200 times, agent benchmarks by 2 to 3.5 times, and training-in-the-loop benchmarks by about 1 time." class="responsive-chart" role="img">
585-
<div class="axis"><span></span><div class="axis-scale"><span>1×</span><span>10×</span><span>100×</span><span>1k×</span><span>10k×</span><span></span></div><span></span><div class="axis-label">Compression factor (log scale)</div></div>
607+
<div class="axis"><span></span><div class="axis-scale"><span>1×</span><span>10×</span><span>100×</span><span>1k×</span><span>10k×</span></div><span></span><div class="axis-label">Compression factor (log scale)</div></div>
608+
<div class="chart-body" style="--grid-interval: 25%;">
586609
<div class="chart-row"><div class="chart-label">Static benchmarks</div><div class="bar-track"><span class="single-bar" style="--max:57.53%;"></span></div><div class="chart-value">100–200×</div></div>
587610
<div class="chart-row"><div class="chart-label">Agentic benchmarks</div><div class="bar-track"><span class="single-bar" style="--max:13.60%;"></span></div><div class="chart-value">2–3.5×</div></div>
588611
<div class="chart-row"><div class="chart-label">Training-in-the-loop</div><div class="bar-track"><span class="single-bar thin" style="--max:.8%;"></span></div><div class="chart-value">~1×</div></div>
589612
</div>
590-
<figcaption class="figure-caption"><strong>Figure 3.</strong> The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Solid bars show measured compression ranges. The highlighted bar is not a cost threshold; it flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated <em>is</em> the trained model.</figcaption>
613+
</div>
614+
<figcaption class="figure-caption"><strong>Figure 3.</strong> The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Bars show the maximum measured compression that preserves model-rank fidelity; labels give the published range. The highlighted bar flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated <em>is</em> the trained model.</figcaption>
591615
</figure>
592616

593617
<h2 id="reliability-is-the-expensive-part">Reliability is the expensive part</h2>
@@ -633,15 +657,15 @@ description: "A field guide to evaluation costs: where the money goes, why old c
633657
</tr>
634658
</thead>
635659
<tbody>
636-
<tr><td>HELM (per LLM, 2022)</td><td>Static LLM</td><td>~$8,000 – $10,000</td><td>One LLM through full HELM (~4,000 GPU-hrs)</td></tr>
660+
<tr><td>HELM (per LLM, 2022)</td><td>Static LLM</td><td>$85 – $10,926 API; 540 – 4,200 GPU-hrs open</td><td>One LLM through 42 scenarios; per-model table in HELM §6 p. 43</td></tr>
637661
<tr><td>ScienceAgentBench</td><td>Agentic, science</td><td>$0.19 – $77</td><td>One agent config across 102 tasks</td></tr>
638662
<tr><td>TAU-bench Airline</td><td>Agentic</td><td>$0.31 – $180</td><td>One agent across all airline tasks</td></tr>
639663
<tr><td>SciCode</td><td>Agentic, science</td><td>$0.12 – $625</td><td>One agent across 338 sub-problems</td></tr>
640664
<tr><td>CORE-Bench Hard</td><td>Agentic, replication</td><td>$2 – $510</td><td>One agent across 45 papers</td></tr>
641665
<tr><td>SWE-bench Verified Mini</td><td>Agentic, coding</td><td>$4 – $1,600</td><td>One agent across 50 issues</td></tr>
642666
<tr><td>Online Mind2Web</td><td>Agentic, web</td><td>$5 – $1,610</td><td>One agent across 300 web tasks</td></tr>
643667
<tr><td>GAIA</td><td>Agentic, multimodal</td><td>$7.80 – $2,829</td><td>One agent across GAIA tasks</td></tr>
644-
<tr><td>ResearchGym (per seed)</td><td>ML research, training</td><td>$540 – $1,260</td><td>5 tasks × 24h GPU + API</td></tr>
668+
<tr><td>ResearchGym (full pass)</td><td>ML research, training</td><td>$540 – $1,260</td><td>5 tasks × 24h × 3 seeds (~360 GPU-hrs) + API</td></tr>
645669
<tr><td>RE-Bench (full agent)</td><td>ML R&amp;D, training</td><td>$1,200 – $1,800</td><td>7 environments × 8h on H100</td></tr>
646670
<tr><td>The Well (per architecture)</td><td>SciML, training</td><td>$1,920 – $2,880</td><td>5 LRs × 16 datasets × 12h H100</td></tr>
647671
<tr><td>MLE-Bench (1 seed)</td><td>ML R&amp;D, training</td><td>~$2,700 – $3,000</td><td>75 Kaggle competitions × 24h on A10</td></tr>

0 commit comments

Comments
 (0)