Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
Open
Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Adds the credibility-tightening tier-1 leaderboard changes from
docs/site_improvements_scope.md, plus a shared sticky header that the
paper page now reuses.
Header
- Extract SiteHeader from Hero. The new component owns the sticky
brand + nav + view-selector + action-link layout and supports an
alwaysExpanded mode for pages without an in-page hero.
- Hero refactored to wrap SiteHeader and pass the country-aware
subtitle, stat strip, and snapshot pill as expandedContent. Drop the
"Top score" stat and the "Leading: <model>" sidebar; the leaderboard
itself is the canonical source for both.
- /paper uses SiteHeader with alwaysExpanded, no view selector, and a
Benchmark action link. The page body keeps its eyebrow/buttons/iframe.
Open-set banner + snapshot pill
- Above the leaderboard, a warning-tinted note states that scenarios
and reference outputs are public, so the public preview is open-set.
- Snapshot date pill (Snapshot 2026-05-01) appears in the hero stat row
on the home page and next to the Manuscript eyebrow on /paper.
Sensitivity-view selector
- New segmented control with five views: Main, Amount only, Binary
only, Positive cases, Zero cases. Selecting a view rescores models
client-side from scenarioPredictions and reorders the leaderboard;
the description for the active view appears next to the selector.
- New utilities under app/src/lib/:
- scoring.ts ports score_single_prediction (mean of exact, within-1%,
within-5%, within-10% for amount; classification accuracy for
binary; output-group resolution for person-expanded variables).
Verified against canonical analysis.py against the snapshot for
both US and UK headline scopes.
- sensitivity.ts builds the per-row score table from a DashboardBundle
and aggregates output-group means -> country -> global, preserving
the country-equal weighting. Sensitivity views filter rows before
aggregation.
Bootstrap rank intervals
- bootstrap.ts implements the household-resampling bootstrap with a
deterministic mulberry32 RNG (seed 42, 400 draws) and reports the
95% score interval and the rank range for each model under the
active sensitivity view.
- ModelLeaderboard renders Rank N(-M) - 95% L-U next to each model's
point estimate, with a tooltip naming the bootstrap parameters.
Repo
- Move the python wheel-artifact lib/ rule in .gitignore to /lib/ and
/lib64/ (top-level only) so app/src/lib/ is tracked.
Verification
- bun run lint - clean
- bun run build - clean (Next.js 16 production build)
- bun run start - SSR render of / contains the open-set banner, the
snapshot pill, the five sensitivity selector chips, and per-model
Rank/95% interval rows for all 12 models. /paper renders SiteHeader
with the snapshot pill and Benchmark action link, no view selector.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the tier-1 leaderboard improvements from
docs/site_improvements_scope.md, plus a shared sticky header that the/paperpage now reuses (matching the home page chrome, without the Global/US/UK view selector).Header
SiteHeaderfromHero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports analwaysExpandedmode for pages that don't drive their own collapse.Herorefactored to wrapSiteHeaderand pass the country-aware subtitle, stat strip, and snapshot pill asexpandedContent. Drops theTop scorestat and theLeading: <model>sidebar — the leaderboard itself is the canonical source for both./paperusesSiteHeaderwithalwaysExpanded, no view selector, and aBenchmarkaction link. The page body keeps its eyebrow/buttons/iframe; the inline H1 is gone since the header carries the brand.Open-set banner and snapshot pill
Snapshot 2026-05-01) in the hero stat row on the home page and next to theManuscripteyebrow on/paper.Sensitivity-view selector
Main/Amount only/Binary only/Positive cases/Zero cases. Selecting a view rescores models client-side fromscenarioPredictionsand reorders the table; the description for the active view appears inline.app/src/lib/:scoring.tsportsscore_single_prediction(mean of exact / within-1% / within-5% / within-10% for amount outputs; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonicalanalysis.pyon the snapshot for both US and UK headline scopes.sensitivity.tsbuilds the per-row score table from aDashboardBundleand aggregates output-group means → country → global, preserving country-equal weighting. Sensitivity views filter rows before aggregation.Bootstrap rank intervals
bootstrap.tsimplements the household-resampling bootstrap with a deterministicmulberry32RNG (seed42,400draws). For each model in the active sensitivity view, it reports the 95% score interval and the rank range.ModelLeaderboardrendersRank N(-M) · 95% L-Unext to each model's point estimate. Sample output for the main global view: Rank 1 has 95% CI79.8-83.5; Rank 2-3 cluster at77.4-81.8; the tail spreadsRank 8-11/Rank 9-11. Tooltip names the bootstrap parameters.Repo
lib/rule in.gitignoreto/lib/and/lib64/(top-level only) soapp/src/lib/is tracked.Verification
bun run lint— cleanbun run build— clean (Next.js 16 production build)bun run start— SSR render of/contains the open-set banner, the snapshot pill, the five sensitivity selector chips, and per-modelRank/95%interval rows for all 12 models./paperrendersSiteHeaderwith the snapshot pill andBenchmarkaction link, no view selector.analysis.pyfor both countries' top-5 models (gpt-5.5: US 89.20 / UK 81.64; grok-4.20: US 88.71; gemini-3.1-pro-preview: UK 79.61, etc.).Test plan
/— confirm header drops "Top score" stat and "Leading: GPT-5.5" pill, and the snapshot pill is visiblePositive cases— leaderboard reorders and ranks/intervals refreshZero cases— same/paper— confirm the same sticky header style without a view selectorFollow-ups (not in this PR)
/model/[id])docs/site_improvements_scope.md🤖 Generated with Claude Code