Skip to content

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9

Open
MaxGhenis wants to merge 1 commit intomainfrom
site-tier1
Open

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
MaxGhenis wants to merge 1 commit intomainfrom
site-tier1

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Implements the tier-1 leaderboard improvements from docs/site_improvements_scope.md, plus a shared sticky header that the /paper page now reuses (matching the home page chrome, without the Global/US/UK view selector).

Header

  • Extract SiteHeader from Hero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports an alwaysExpanded mode for pages that don't drive their own collapse.
  • Hero refactored to wrap SiteHeader and pass the country-aware subtitle, stat strip, and snapshot pill as expandedContent. Drops the Top score stat and the Leading: <model> sidebar — the leaderboard itself is the canonical source for both.
  • /paper uses SiteHeader with alwaysExpanded, no view selector, and a Benchmark action link. The page body keeps its eyebrow/buttons/iframe; the inline H1 is gone since the header carries the brand.

Open-set banner and snapshot pill

  • Warning-tinted note above the leaderboard: "Open-set leaderboard. The public scenario explorer exposes prompts and PolicyEngine reference outputs, so future model releases or fine-tunes could learn from the released cases…".
  • Snapshot date pill (Snapshot 2026-05-01) in the hero stat row on the home page and next to the Manuscript eyebrow on /paper.

Sensitivity-view selector

  • Segmented control above the leaderboard table: Main / Amount only / Binary only / Positive cases / Zero cases. Selecting a view rescores models client-side from scenarioPredictions and reorders the table; the description for the active view appears inline.
  • New utilities under app/src/lib/:
    • scoring.ts ports score_single_prediction (mean of exact / within-1% / within-5% / within-10% for amount outputs; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonical analysis.py on the snapshot for both US and UK headline scopes.
    • sensitivity.ts builds the per-row score table from a DashboardBundle and aggregates output-group means → country → global, preserving country-equal weighting. Sensitivity views filter rows before aggregation.

Bootstrap rank intervals

  • bootstrap.ts implements the household-resampling bootstrap with a deterministic mulberry32 RNG (seed 42, 400 draws). For each model in the active sensitivity view, it reports the 95% score interval and the rank range.
  • ModelLeaderboard renders Rank N(-M) · 95% L-U next to each model's point estimate. Sample output for the main global view: Rank 1 has 95% CI 79.8-83.5; Rank 2-3 cluster at 77.4-81.8; the tail spreads Rank 8-11 / Rank 9-11. Tooltip names the bootstrap parameters.

Repo

  • Move the Python wheel-artifact lib/ rule in .gitignore to /lib/ and /lib64/ (top-level only) so app/src/lib/ is tracked.

Verification

  • bun run lint — clean
  • bun run build — clean (Next.js 16 production build)
  • bun run start — SSR render of / contains the open-set banner, the snapshot pill, the five sensitivity selector chips, and per-model Rank/95% interval rows for all 12 models. /paper renders SiteHeader with the snapshot pill and Benchmark action link, no view selector.
  • Scoring math reconciled against analysis.py for both countries' top-5 models (gpt-5.5: US 89.20 / UK 81.64; grok-4.20: US 88.71; gemini-3.1-pro-preview: UK 79.61, etc.).

Test plan

  • CI passes
  • Visit / — confirm header drops "Top score" stat and "Leading: GPT-5.5" pill, and the snapshot pill is visible
  • Switch sensitivity view to Positive cases — leaderboard reorders and ranks/intervals refresh
  • Switch to Zero cases — same
  • Visit /paper — confirm the same sticky header style without a view selector
  • Mobile width — confirm bootstrap interval line wraps under each model row

Follow-ups (not in this PR)

  • Tier-1 per-model deep-dive page (/model/[id])
  • Tier-2 cross-country compare and cost surfacing — see docs/site_improvements_scope.md

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench Ready Ready Preview, Comment May 6, 2026 10:38am

Request Review

Adds the credibility-tightening tier-1 leaderboard changes from
docs/site_improvements_scope.md, plus a shared sticky header that the
paper page now reuses.

Header
- Extract SiteHeader from Hero. The new component owns the sticky
  brand + nav + view-selector + action-link layout and supports an
  alwaysExpanded mode for pages without an in-page hero.
- Hero refactored to wrap SiteHeader and pass the country-aware
  subtitle, stat strip, and snapshot pill as expandedContent. Drop the
  "Top score" stat and the "Leading: <model>" sidebar; the leaderboard
  itself is the canonical source for both.
- /paper uses SiteHeader with alwaysExpanded, no view selector, and a
  Benchmark action link. The page body keeps its eyebrow/buttons/iframe.

Open-set banner + snapshot pill
- Above the leaderboard, a warning-tinted note states that scenarios
  and reference outputs are public, so the public preview is open-set.
- Snapshot date pill (Snapshot 2026-05-01) appears in the hero stat row
  on the home page and next to the Manuscript eyebrow on /paper.

Sensitivity-view selector
- New segmented control with five views: Main, Amount only, Binary
  only, Positive cases, Zero cases. Selecting a view rescores models
  client-side from scenarioPredictions and reorders the leaderboard;
  the description for the active view appears next to the selector.
- New utilities under app/src/lib/:
  - scoring.ts ports score_single_prediction (mean of exact, within-1%,
    within-5%, within-10% for amount; classification accuracy for
    binary; output-group resolution for person-expanded variables).
    Verified against canonical analysis.py against the snapshot for
    both US and UK headline scopes.
  - sensitivity.ts builds the per-row score table from a DashboardBundle
    and aggregates output-group means -> country -> global, preserving
    the country-equal weighting. Sensitivity views filter rows before
    aggregation.

Bootstrap rank intervals
- bootstrap.ts implements the household-resampling bootstrap with a
  deterministic mulberry32 RNG (seed 42, 400 draws) and reports the
  95% score interval and the rank range for each model under the
  active sensitivity view.
- ModelLeaderboard renders Rank N(-M) - 95% L-U next to each model's
  point estimate, with a tooltip naming the bootstrap parameters.

Repo
- Move the python wheel-artifact lib/ rule in .gitignore to /lib/ and
  /lib64/ (top-level only) so app/src/lib/ is tracked.

Verification
- bun run lint - clean
- bun run build - clean (Next.js 16 production build)
- bun run start - SSR render of / contains the open-set banner, the
  snapshot pill, the five sensitivity selector chips, and per-model
  Rank/95% interval rows for all 12 models. /paper renders SiteHeader
  with the snapshot pill and Benchmark action link, no view selector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant