Site tier-1: open-set banner, sensitivity selector, bootstrap intervals by MaxGhenis · Pull Request #9 · PolicyEngine/policybench

MaxGhenis · 2026-05-02T13:36:18Z

Summary

Implements the tier-1 leaderboard improvements from docs/site_improvements_scope.md, plus a shared sticky header that the /paper page now reuses (matching the home page chrome, without the Global/US/UK view selector).

Header

Extract SiteHeader from Hero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports an alwaysExpanded mode for pages that don't drive their own collapse.
Hero refactored to wrap SiteHeader and pass the country-aware subtitle, stat strip, and snapshot pill as expandedContent. Drops the Top score stat and the Leading: <model> sidebar — the leaderboard itself is the canonical source for both.
/paper uses SiteHeader with alwaysExpanded, no view selector, and a Benchmark action link. The page body keeps its eyebrow/buttons/iframe; the inline H1 is gone since the header carries the brand.

Open-set banner and snapshot pill

Warning-tinted note above the leaderboard: "Open-set leaderboard. The public scenario explorer exposes prompts and PolicyEngine reference outputs, so future model releases or fine-tunes could learn from the released cases…".
Snapshot date pill (Snapshot 2026-05-01) in the hero stat row on the home page and next to the Manuscript eyebrow on /paper.

Sensitivity-view selector

Segmented control above the leaderboard table: Main / Amount only / Binary only / Positive cases / Zero cases. Selecting a view rescores models client-side from scenarioPredictions and reorders the table; the description for the active view appears inline.
New utilities under app/src/lib/:
- scoring.ts ports score_single_prediction (mean of exact / within-1% / within-5% / within-10% for amount outputs; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonical analysis.py on the snapshot for both US and UK headline scopes.
- sensitivity.ts builds the per-row score table from a DashboardBundle and aggregates output-group means → country → global, preserving country-equal weighting. Sensitivity views filter rows before aggregation.

Bootstrap rank intervals

bootstrap.ts implements the household-resampling bootstrap with a deterministic mulberry32 RNG (seed 42, 400 draws). For each model in the active sensitivity view, it reports the 95% score interval and the rank range.
ModelLeaderboard renders Rank N(-M) · 95% L-U next to each model's point estimate. Sample output for the main global view: Rank 1 has 95% CI 79.8-83.5; Rank 2-3 cluster at 77.4-81.8; the tail spreads Rank 8-11 / Rank 9-11. Tooltip names the bootstrap parameters.

Repo

Move the Python wheel-artifact lib/ rule in .gitignore to /lib/ and /lib64/ (top-level only) so app/src/lib/ is tracked.

Verification

bun run lint — clean
bun run build — clean (Next.js 16 production build)
bun run start — SSR render of / contains the open-set banner, the snapshot pill, the five sensitivity selector chips, and per-model Rank/95% interval rows for all 12 models. /paper renders SiteHeader with the snapshot pill and Benchmark action link, no view selector.
Scoring math reconciled against analysis.py for both countries' top-5 models (gpt-5.5: US 89.20 / UK 81.64; grok-4.20: US 88.71; gemini-3.1-pro-preview: UK 79.61, etc.).

Test plan

CI passes
Visit / — confirm header drops "Top score" stat and "Leading: GPT-5.5" pill, and the snapshot pill is visible
Switch sensitivity view to Positive cases — leaderboard reorders and ranks/intervals refresh
Switch to Zero cases — same
Visit /paper — confirm the same sticky header style without a view selector
Mobile width — confirm bootstrap interval line wraps under each model row

Follow-ups (not in this PR)

Tier-1 per-model deep-dive page (/model/[id])
Tier-2 cross-country compare and cost surfacing — see docs/site_improvements_scope.md

🤖 Generated with Claude Code

vercel · 2026-05-02T13:36:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench	Ready	Preview, Comment	May 6, 2026 10:38am

Adds the credibility-tightening tier-1 leaderboard changes from docs/site_improvements_scope.md, plus a shared sticky header that the paper page now reuses. Header - Extract SiteHeader from Hero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports an alwaysExpanded mode for pages without an in-page hero. - Hero refactored to wrap SiteHeader and pass the country-aware subtitle, stat strip, and snapshot pill as expandedContent. Drop the "Top score" stat and the "Leading: <model>" sidebar; the leaderboard itself is the canonical source for both. - /paper uses SiteHeader with alwaysExpanded, no view selector, and a Benchmark action link. The page body keeps its eyebrow/buttons/iframe. Open-set banner + snapshot pill - Above the leaderboard, a warning-tinted note states that scenarios and reference outputs are public, so the public preview is open-set. - Snapshot date pill (Snapshot 2026-05-01) appears in the hero stat row on the home page and next to the Manuscript eyebrow on /paper. Sensitivity-view selector - New segmented control with five views: Main, Amount only, Binary only, Positive cases, Zero cases. Selecting a view rescores models client-side from scenarioPredictions and reorders the leaderboard; the description for the active view appears next to the selector. - New utilities under app/src/lib/: - scoring.ts ports score_single_prediction (mean of exact, within-1%, within-5%, within-10% for amount; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonical analysis.py against the snapshot for both US and UK headline scopes. - sensitivity.ts builds the per-row score table from a DashboardBundle and aggregates output-group means -> country -> global, preserving the country-equal weighting. Sensitivity views filter rows before aggregation. Bootstrap rank intervals - bootstrap.ts implements the household-resampling bootstrap with a deterministic mulberry32 RNG (seed 42, 400 draws) and reports the 95% score interval and the rank range for each model under the active sensitivity view. - ModelLeaderboard renders Rank N(-M) - 95% L-U next to each model's point estimate, with a tooltip naming the bootstrap parameters. Repo - Move the python wheel-artifact lib/ rule in .gitignore to /lib/ and /lib64/ (top-level only) so app/src/lib/ is tracked. Verification - bun run lint - clean - bun run build - clean (Next.js 16 production build) - bun run start - SSR render of / contains the open-set banner, the snapshot pill, the five sensitivity selector chips, and per-model Rank/95% interval rows for all 12 models. /paper renders SiteHeader with the snapshot pill and Benchmark action link, no view selector.

MaxGhenis force-pushed the site-tier1 branch from 820d666 to 22c2e42 Compare May 6, 2026 10:35

vercel Bot deployed to Preview May 6, 2026 10:38 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
MaxGhenis wants to merge 1 commit intomainfrom
site-tier1

MaxGhenis commented May 2, 2026

Uh oh!

vercel Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented May 2, 2026

Summary

Header

Open-set banner and snapshot pill

Sensitivity-view selector

Bootstrap rank intervals

Repo

Verification

Test plan

Follow-ups (not in this PR)

Uh oh!

vercel Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 2, 2026 •

edited

Loading