You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix HELM numbers, chart axes, and unbroken gridlines
Cross-checked HELM cost claims against Section 6 model table (p. 43):
replaced loose "$10K or 4,000+ GPU-hours per model" with actual
range, corrected aggregate from "high six figures" to ~$100K, and
updated the cost-summary table entry. Fixed Pythia "16 model sizes"
→ "16 models spanning 8 sizes". Relabeled ResearchGym row to "full
pass (3 seeds)" so the dollars match the GPU-hours.
Chart fixes: axis labels now align with bar positions (flex
space-between instead of grid with centered labels). Figure 2 axis
converted to uniform decades ($100/$1k/$10k/$100k); all bars
recomputed and small ~1% errors corrected. Figure 3 caption
clarifies that bars show maximum compression, not ranges.
Vertical gridlines are now continuous across all rows (chart-body
wrapper with absolute-positioned ::before instead of per-track
backgrounds). Each figure sets its own --grid-interval. Mobile
keeps the per-track gradient.
Removed three stray image-markdown references accidentally pasted
into "consequences" in the closing section.
<p>The cost problem started before agents. When Stanford's CRFM released <ahref="https://arxiv.org/abs/2211.09110"rel="noopener noreferrer"target="_blank">HELM</a> in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al. (2023)</a> restate that figure, and <ahref="https://research.ibm.com/blog/efficient-llm-benchmarking"rel="noopener noreferrer"target="_blank">IBM Research</a> notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures. </p>
503
+
<p>The cost problem started before agents. When Stanford's CRFM released <ahref="https://arxiv.org/abs/2211.09110"rel="noopener noreferrer"target="_blank">HELM</a> in 2022, the paper's own per-model accounting (Section 6, p. 43) showed API costs ranging from $169 for OpenAI's ada (350M) to $10,926 for AI21's J1-Jumbo (178B), and 540 to 4,200 GPU-hours for the open models, with BLOOM (176B) and OPT (175B) at the top end. <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al. (2023)</a> restate those figures, and <ahref="https://research.ibm.com/blog/efficient-llm-benchmarking"rel="noopener noreferrer"target="_blank">IBM Research</a> notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Across HELM's 30 models and 42 scenarios, the aggregate of reported costs and GPU compute came to roughly $100,000.</p>
486
504
487
-
<p>The more striking observation came from <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al.'s analysis</a> of <ahref="https://arxiv.org/abs/2304.01373"rel="noopener noreferrer"target="_blank">EleutherAI's Pythia</a> checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al. (2024)</a> noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.</p>
505
+
<p>The more striking observation came from <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al.'s analysis</a> of <ahref="https://arxiv.org/abs/2304.01373"rel="noopener noreferrer"target="_blank">EleutherAI's Pythia</a> checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 models spanning 8 sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: <ahref="https://arxiv.org/abs/2308.11696v5"rel="noopener noreferrer"target="_blank">Perlitz et al. (2024)</a> noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.</p>
488
506
489
507
<p>Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information; it was confirming rankings that the field could have inferred much more cheaply.</p>
490
508
@@ -507,13 +525,15 @@ description: "A field guide to evaluation costs: where the money goes, why old c
507
525
</div>
508
526
<divaria-label="Per-run cost ranges on agent benchmarks from 19 cents to 2829 dollars."class="responsive-chart"role="img">
<figcaptionclass="figure-caption"><strong>Figure 1.</strong> Each bar shows the minimum-to-maximum cost across HAL configurations on a single benchmark. Highlighted bars cross the round $1,000-per-run threshold. A "run" is one full agent evaluation across all tasks. Within-benchmark spread reflects the model × scaffold × token-budget product. Source: live HAL leaderboard, April 2026.</figcaption>
519
539
</figure>
@@ -558,14 +578,16 @@ description: "A field guide to evaluation costs: where the money goes, why old c
558
578
<spanclass="legend-item"><spanclass="legend-swatch red"></span>$5,000 or more</span>
559
579
</div>
560
580
<divaria-label="Training-in-the-loop benchmark costs range from about 540 dollars to 11500 dollars."class="responsive-chart"role="img">
561
-
<divclass="axis"><span></span><divclass="axis-scale"><span>$100</span><span>$500</span><span>$1k</span><span>$5k</span><span>$10k</span><span>$20k</span></div><span></span><divclass="axis-label">USD per single evaluation (log scale)</div></div>
<divclass="chart-row"><divclass="chart-label">The Well (per architecture)</div><divclass="range-track"><spanclass="range-bar"style="--min:54.77%;--max:62.37%;"></span></div><divclass="chart-value">$1,920–$2,880</div></div>
<divclass="chart-row"><divclass="chart-label">The Well (full sweep)</div><divclass="range-track"><spanclass="range-bar"style="--min:82.01%;--max:89.60%;--series:var(--eval-warn);"></span></div><divclass="chart-value">$7,700–$11,500</div></div>
<divclass="axis"><span></span><divclass="axis-scale"><span>$100</span><span>$1k</span><span>$10k</span><span>$100k</span></div><span></span><divclass="axis-label">USD per single evaluation (log scale)</div></div>
<divclass="chart-row"><divclass="chart-label">The Well (per architecture)</div><divclass="range-track"><spanclass="range-bar"style="--min:42.78%;--max:48.65%;"></span></div><divclass="chart-value">$1,920–$2,880</div></div>
<divclass="chart-row"><divclass="chart-label">The Well (full sweep)</div><divclass="range-track"><spanclass="range-bar"style="--min:62.88%;--max:68.69%;--series:var(--eval-warn);"></span></div><divclass="chart-value">$7,700–$11,500</div></div>
<figcaptionclass="figure-caption"><strong>Figure 2.</strong> All values in USD per single evaluation of one model or agent through the full benchmark protocol. GPU costs converted at $2.50/H100-hr, $1.50/A10-hr; API and grading costs included where applicable. Highlighted bars denote benchmarks costing at least the round $5,000-per-evaluation threshold. The most expensive of these match the most expensive agent benchmarks (Figure 1) but require GPU compute that has no API substitute.</figcaption>
571
593
</figure>
@@ -578,16 +600,18 @@ description: "A field guide to evaluation costs: where the money goes, why old c
578
600
<divclass="chart-title">Compression factors achievable by benchmark type</div>
579
601
<divclass="chart-subtitle">Maximum reduction in evaluation compute that preserves model-rank fidelity, log scale</div>
<spanclass="legend-item"><spanclass="legend-swatch block red"></span>No general compression method</span>
583
605
</div>
584
606
<divaria-label="Static benchmarks compress by about 100 to 200 times, agent benchmarks by 2 to 3.5 times, and training-in-the-loop benchmarks by about 1 time."class="responsive-chart"role="img">
<figcaptionclass="figure-caption"><strong>Figure 3.</strong> The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Solid bars show measured compression ranges. The highlighted bar is not a cost threshold; it flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated <em>is</em> the trained model.</figcaption>
613
+
</div>
614
+
<figcaptionclass="figure-caption"><strong>Figure 3.</strong> The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Bars show the maximum measured compression that preserves model-rank fidelity; labels give the published range. The highlighted bar flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated <em>is</em> the trained model.</figcaption>
591
615
</figure>
592
616
593
617
<h2id="reliability-is-the-expensive-part">Reliability is the expensive part</h2>
@@ -633,15 +657,15 @@ description: "A field guide to evaluation costs: where the money goes, why old c
633
657
</tr>
634
658
</thead>
635
659
<tbody>
636
-
<tr><td>HELM (per LLM, 2022)</td><td>Static LLM</td><td>~$8,000 – $10,000</td><td>One LLM through full HELM (~4,000 GPU-hrs)</td></tr>
660
+
<tr><td>HELM (per LLM, 2022)</td><td>Static LLM</td><td>$85 – $10,926 API; 540 – 4,200 GPU-hrs open</td><td>One LLM through 42 scenarios; per-model table in HELM §6 p. 43</td></tr>
637
661
<tr><td>ScienceAgentBench</td><td>Agentic, science</td><td>$0.19 – $77</td><td>One agent config across 102 tasks</td></tr>
638
662
<tr><td>TAU-bench Airline</td><td>Agentic</td><td>$0.31 – $180</td><td>One agent across all airline tasks</td></tr>
639
663
<tr><td>SciCode</td><td>Agentic, science</td><td>$0.12 – $625</td><td>One agent across 338 sub-problems</td></tr>
640
664
<tr><td>CORE-Bench Hard</td><td>Agentic, replication</td><td>$2 – $510</td><td>One agent across 45 papers</td></tr>
641
665
<tr><td>SWE-bench Verified Mini</td><td>Agentic, coding</td><td>$4 – $1,600</td><td>One agent across 50 issues</td></tr>
642
666
<tr><td>Online Mind2Web</td><td>Agentic, web</td><td>$5 – $1,610</td><td>One agent across 300 web tasks</td></tr>
643
667
<tr><td>GAIA</td><td>Agentic, multimodal</td><td>$7.80 – $2,829</td><td>One agent across GAIA tasks</td></tr>
0 commit comments