Fix residual paper issues and scope site improvements by MaxGhenis · Pull Request #8 · PolicyEngine/policybench

MaxGhenis · 2026-05-02T12:49:03Z

Summary

Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.

Paper / data residuals

UK transfer dataset. Replace the "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL. policybench/scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback.
Validation framing. Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage.
Model-alias instability. Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases.

Manuscript and docs cleanup

docs/ consolidation. Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thin docs/paper.md.
Methodology refinements. Reframe the bounded score as step-credit by error band (the mean of four nested thresholds is mathematically equivalent to step partial credit). Expand the bootstrap caveat to enumerate the uncertainty sources the household-resampling intervals do not capture (prompt variance, decoding stochasticity, provider drift, reference-output uncertainty).
Exclusion-fraction reporting. Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit / single-family / single-SPM-unit filter) and the UK exclusion fraction (0.1%). State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags.

New paper sections

@tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.
@tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.

Site scoping

New docs/site_improvements_scope.md ranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.

Verification

uv run pytest -q — 186 passed
uv run ruff check . — clean
uv run ruff format --check . — clean
bun run lint — clean
bun run build — clean
uv run python paper/render_paper.py — regenerated PDF + web
Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

Test plan

CI passes on the branch
Spot-check the rendered PDF for the new federal+state and impact-floor tables
Verify manifest.json hashes still match committed files

🤖 Generated with Claude Code

vercel · 2026-05-02T12:49:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench	Ready	Preview, Comment	May 6, 2026 10:38am

UK transfer dataset - Replace "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and runtime metadata all reference the same pinned commit URL. - scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback. Validation framing - Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage. Model-alias instability - Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions explicitly. - Manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases. docs/ consolidation - Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook and benchmark card and points to the rendered manuscript. - Add docs/paper.md as a thin reading guide and update myst.yml. Methodology and scope refinements - Reframe the bounded score as step-credit by error band (paper methodology section). The mean-of-four-thresholds is mathematically equivalent to step partial credit because the thresholds are nested. - Expand the bootstrap caveat to enumerate which uncertainty sources (prompt variance, decoding stochasticity, provider drift, reference- output uncertainty) the household-resampling intervals do not cover. - Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit/single-family/single-SPM-unit filter) and the UK exclusion fraction (0.1%). - State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags. New paper sections - @tbl-fed-state: US within-10% accuracy on federal vs state refundable credits and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. - @tbl-impact-floor: top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can see whether the 0.3 default is load-bearing. Site scoping - New docs/site_improvements_scope.md ranks improvements to policybench.org from open-set leakage banner / sensitivity selector / bootstrap rank intervals (tier 1) through cross-country compare, per-model deep-dive pages, cost surfacing, scenario filtering, and protected leaderboard (tiers 2-4). Verification - uv run pytest -q (186 passed) - uv run ruff check . (clean) - uv run ruff format --check . (clean) - bun run lint (clean) - bun run build (clean) - uv run python paper/render_paper.py (regenerated PDF + web) - Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

MaxGhenis force-pushed the fix-paper-residuals branch from ec531d5 to 6dc2107 Compare May 6, 2026 10:35

vercel Bot deployed to Preview May 6, 2026 10:38 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix residual paper issues and scope site improvements#8

Fix residual paper issues and scope site improvements#8
MaxGhenis wants to merge 1 commit intomainfrom
fix-paper-residuals

MaxGhenis commented May 2, 2026

Uh oh!

vercel Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented May 2, 2026

Summary

Paper / data residuals

Manuscript and docs cleanup

New paper sections

Site scoping

Verification

Test plan

Uh oh!

vercel Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 2, 2026 •

edited

Loading