Fix residual paper issues and scope site improvements#8
Open
Fix residual paper issues and scope site improvements#8
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
UK transfer dataset - Replace "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and runtime metadata all reference the same pinned commit URL. - scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback. Validation framing - Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage. Model-alias instability - Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions explicitly. - Manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases. docs/ consolidation - Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook and benchmark card and points to the rendered manuscript. - Add docs/paper.md as a thin reading guide and update myst.yml. Methodology and scope refinements - Reframe the bounded score as step-credit by error band (paper methodology section). The mean-of-four-thresholds is mathematically equivalent to step partial credit because the thresholds are nested. - Expand the bootstrap caveat to enumerate which uncertainty sources (prompt variance, decoding stochasticity, provider drift, reference- output uncertainty) the household-resampling intervals do not cover. - Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit/single-family/single-SPM-unit filter) and the UK exclusion fraction (0.1%). - State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags. New paper sections - @tbl-fed-state: US within-10% accuracy on federal vs state refundable credits and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. - @tbl-impact-floor: top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can see whether the 0.3 default is load-bearing. Site scoping - New docs/site_improvements_scope.md ranks improvements to policybench.org from open-set leakage banner / sensitivity selector / bootstrap rank intervals (tier 1) through cross-country compare, per-model deep-dive pages, cost surfacing, scenario filtering, and protected leaderboard (tiers 2-4). Verification - uv run pytest -q (186 passed) - uv run ruff check . (clean) - uv run ruff format --check . (clean) - bun run lint (clean) - bun run build (clean) - uv run python paper/render_paper.py (regenerated PDF + web) - Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.
ec531d5 to
6dc2107
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.
Paper / data residuals
PolicyEngine/policyengine-uk-dataat pinned commit9514dfb7, sha256199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL.policybench/scenarios.pynow sha256-verifies the local file and falls back to a download from the pinnedraw.githubusercontent.comURL when no local copy is available.POLICYBENCH_UK_DATASET_DOWNLOAD=0disables the download fallback.provider_response_id,provider_system_fingerprint, andprovider_resolved_modelineval_no_toolspredictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that mostconfig.pymodel IDs are provider aliases.Manuscript and docs cleanup
docs/consolidation. Delete duplicate prose indocs/introduction.md,docs/methodology.md,docs/discussion.md,docs/references.md.paper/index.qmdis now the canonical manuscript;docs/keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thindocs/paper.md.New paper sections
@tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.@tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.Site scoping
docs/site_improvements_scope.mdranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.Verification
uv run pytest -q— 186 passeduv run ruff check .— cleanuv run ruff format --check .— cleanbun run lint— cleanbun run build— cleanuv run python paper/render_paper.py— regenerated PDF + webdashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.Test plan
manifest.jsonhashes still match committed files🤖 Generated with Claude Code