Skip to content

Fix residual paper issues and scope site improvements#8

Open
MaxGhenis wants to merge 1 commit intomainfrom
fix-paper-residuals
Open

Fix residual paper issues and scope site improvements#8
MaxGhenis wants to merge 1 commit intomainfrom
fix-paper-residuals

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.

Paper / data residuals

  • UK transfer dataset. Replace the "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL. policybench/scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback.
  • Validation framing. Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage.
  • Model-alias instability. Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases.

Manuscript and docs cleanup

  • docs/ consolidation. Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thin docs/paper.md.
  • Methodology refinements. Reframe the bounded score as step-credit by error band (the mean of four nested thresholds is mathematically equivalent to step partial credit). Expand the bootstrap caveat to enumerate the uncertainty sources the household-resampling intervals do not capture (prompt variance, decoding stochasticity, provider drift, reference-output uncertainty).
  • Exclusion-fraction reporting. Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit / single-family / single-SPM-unit filter) and the UK exclusion fraction (0.1%). State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags.

New paper sections

  • @tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.
  • @tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.

Site scoping

  • New docs/site_improvements_scope.md ranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.

Verification

  • uv run pytest -q — 186 passed
  • uv run ruff check . — clean
  • uv run ruff format --check . — clean
  • bun run lint — clean
  • bun run build — clean
  • uv run python paper/render_paper.py — regenerated PDF + web
  • Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

Test plan

  • CI passes on the branch
  • Spot-check the rendered PDF for the new federal+state and impact-floor tables
  • Verify manifest.json hashes still match committed files

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench Ready Ready Preview, Comment May 6, 2026 10:38am

Request Review

UK transfer dataset
- Replace "public UK calibrated transfer artifact" wording with concrete
  provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data
  at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and
  runtime metadata all reference the same pinned commit URL.
- scenarios.py now sha256-verifies the local file and falls back to a
  download from the pinned raw.githubusercontent.com URL when no local
  copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the
  download fallback.

Validation framing
- Replace "penny-level agreement for the vast majority of 2021 test
  cases" with the source's actual qualitative phrasing and an explicit
  note that we do not restate it as a single percentage.

Model-alias instability
- Capture provider_response_id, provider_system_fingerprint, and
  provider_resolved_model in eval_no_tools predictions for runs after
  this snapshot, so future snapshots can pin alias resolutions
  explicitly.
- Manifest reproducibility note documents that the 2026-05-01 snapshot
  predates fingerprint capture and that most config.py model IDs are
  provider aliases.

docs/ consolidation
- Delete duplicate prose in docs/introduction.md, docs/methodology.md,
  docs/discussion.md, docs/references.md. paper/index.qmd is now the
  canonical manuscript; docs/ keeps the operational runbook and
  benchmark card and points to the rendered manuscript.
- Add docs/paper.md as a thin reading guide and update myst.yml.

Methodology and scope refinements
- Reframe the bounded score as step-credit by error band (paper
  methodology section). The mean-of-four-thresholds is mathematically
  equivalent to step partial credit because the thresholds are nested.
- Expand the bootstrap caveat to enumerate which uncertainty sources
  (prompt variance, decoding stochasticity, provider drift, reference-
  output uncertainty) the household-resampling intervals do not cover.
- Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source
  households fail the single-tax-unit/single-family/single-SPM-unit
  filter) and the UK exclusion fraction (0.1%).
- State explicitly that filing status is not in the prompt and is
  inferred by the reference computation from tax-unit role flags.

New paper sections
- @tbl-fed-state: US within-10% accuracy on federal vs state refundable
  credits and the household-level joint, surfacing how marginal
  accuracy hides joint federal/state credit errors.
- @tbl-impact-floor: top three global ranks under household-equal
  impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can
  see whether the 0.3 default is load-bearing.

Site scoping
- New docs/site_improvements_scope.md ranks improvements to
  policybench.org from open-set leakage banner / sensitivity selector /
  bootstrap rank intervals (tier 1) through cross-country compare,
  per-model deep-dive pages, cost surfacing, scenario filtering, and
  protected leaderboard (tiers 2-4).

Verification
- uv run pytest -q  (186 passed)
- uv run ruff check .  (clean)
- uv run ruff format --check .  (clean)
- bun run lint  (clean)
- bun run build  (clean)
- uv run python paper/render_paper.py  (regenerated PDF + web)
- Manifest hashes refreshed for dashboard_export, rendered PDF, and web
  bundle to match the regenerated artifacts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant