Postmortem: we dropped Social Security retirement from microsimulation data #968

MaxGhenis · 2026-05-13T12:50:19Z

MaxGhenis
May 13, 2026
Maintainer

A chronological account of how policyengine-us-data published two Social Security regressions over a fourteen-month window, why both slipped through the safety net the team installed after the March 2026 incident, and what would have caught them. All times UTC. Numbers throughout are weighted to the Enhanced CPS household weights.

Reproduction appendix (notebook, plain Python, per-release H5 sweep, results CSV, plots): https://gist.github.com/MaxGhenis/ddf033f52eb51e6082bade87564f7b83 — metadata-only, no large H5 downloads required.

TL;DR

policyengine-us-data shipped Social Security regressions during two distinct windows. The first (~14 days, 2026-02-18 → 2026-03-04) understated total social_security at sim-year 2024 by ~44% via a base-year-vs-projected-year discontinuity. The second (~9 days live, 2026-05-04 → 2026-05-12) understated total social_security at every sim year by ~65% because the retirement subcomponent stopped being emitted to the H5. Both were caused by a data-model contract break around what the data pipeline emits vs what policyengine-us computes. The fix that closed the first regression (#554) established the right contract — emit leaves, compute aggregates — but the model side later violated the same contract from the other direction by converting an emitted input to a formula. Detection of the second regression came from an external signal: ~48% baseline senior SPM poverty in a new dashboard, surfaced 2026-05-12 14:22 UTC and fixed 2h 40m later.

Chronology

2024-08-24 — Latent pattern established

Commit c28821b (@nikhilwoodruff, "Improve calibration for ECPS") removes a line that was shadowing the long IMPUTED_VARIABLES list in extended_cps.py:

-IMPUTED_VARIABLES = ["employment_income"]

After this commit, PUF imputation is active for the full 14-variable list, including social_security. The four sub-components (_retirement, _disability, _survivors, _dependents) are set independently from CPS source codes and are not reconciled with the PUF-imputed total. The H5 now stores two unreconciled paths for the same logical quantity.

But for most of the time, the two paths agreed: the sweep (below) shows the stored aggregate and the subcomponent sum within ~3% of each other from the earliest HF-tagged release (2025-05-27) through 2026-02-17. The mechanical pattern was latent; the magnitude was not yet damaging.

2026-02-12 → 2026-02-17 — `puf_impute.py` extraction

Issue #530 ("CPS top-coding caps AGI at $6.26M") motivates a pipeline restructure that extracts PUF imputation into its own module. The extraction lands in commits 6ecfc19 (2026-02-16) and 85f7e41 (2026-02-17 16:08 UTC) on a branch eventually merged via PR #530/#516.

The IMPUTED_VARIABLES list is preserved, but the new module's PUF-imputation behavior is materially narrower: it imputes positive social_security on far fewer PUF-cloned records than the previous implementation.

2026-02-18 22:11 UTC — Severe discontinuity begins

Release 1.69.0 is the first published H5 built with the new puf_impute.py. The stored social_security/2024 weighted total drops from $1,434B (release 1.68.0, 2026-02-17 16:13 UTC) to $824B in 16 hours. Subcomponents are unaffected at ~$1,430B. The two stored paths now disagree by ~42% in weighted dollars.

When policyengine-us runs a simulation:

At sim-year 2024 (the base year) it uses the stored aggregate → $824B
At sim-year 2025+ it falls through to the formula, summing the four subcomponents → ~$1,430B uprated

This produces a year-over-year ~$600B step jump in total Social Security at the base-year boundary. The same artifact drove an ~9-point Gini drop between 2024 and 2025 (net income Gini 0.6502 → 0.5625), which is what was first noticed.

2026-02-23 — Bug filed

Issue #551 (@MaxGhenis): "PUF imputation overwrites social_security but not sub-components, causing base-year discontinuity." The issue reports the magnitudes at the time of filing (5 days into the severe window):

Path	Nonzero records	Weighted recipients	Weighted total
Base year (stored aggregate)	2,281	33.6M	~$809B
Projected years (formula)	7,197	67.8M	~$1,436B
Ratio	3.15×	2.02×	1.78×

The issue summary describes "~3x more Social Security recipients in projected years vs the base year" — that ratio is correct for raw nonzero records (3.15×) but overstates the weighted-recipient or weighted-dollar impact (2.02× and 1.78× respectively, i.e. a ~44% understatement of total SS at 2024).

PR #552 opens the same day with a tactical fix: rescale subcomponents proportionally after PUF overwrites the aggregate.

2026-02-26 — Tactical fix abandoned in favor of systemic fix

PR #552 is closed unmerged. The team takes the systemic path instead: stop storing variables that policyengine-us computes via formula / adds / subtracts, and impute the leaves directly. This eliminates the dual-path disagreement by removing one of the paths from storage. The work continues as PR #554.

2026-03-04 15:23 UTC — #554 merged; severe discontinuity ends

#554 ("Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation", @PavelMakarchuk) merges. The relevant piece for what follows is _drop_formula_variables: the exporter strips variables that policyengine-us computes via formula / adds / subtracts. The first H5 published after this — 1.71.2 at 2026-03-04 16:56 UTC — no longer stores social_security. Total severe pre-#554 window: ~14 days 17 hours.

The fix has a known cost: the data pipeline now depends on complete and consistent imputation of the four Social Security sub-components on both halves of the dataset. On the PUF clone half, subcomponents aren't yet imputed by the new pipeline, so the first post-#554 H5s undercount total SS by ~4% (subcomponent sum $1,383B vs the pre-#554 subcomponent path of $1,436B). This is a much milder effect than the discontinuity it replaced.

2026-03-04 ~20:00 UTC — Unrelated employment-income corruption

The same _drop_formula_variables introduced in #554 has an unrelated rename bug that drops employment_income from the H5, sending aggregate SPM poverty to ~42% (vs ~11% actual). This is documented in detail in postmortem #578.

2026-03-05 — Prevention measures from #578 land

PR #570 adds three lines of defense in response to the employment-income incident:

Aggregate poverty-rate sanity test, tolerance 5–30%.
KEY_MONETARY_VARS NaN/Inf check, including social_security.
Upload-time validator: file size, structure, employment-income aggregate, weight totals.

None of these will fire on the May Social Security regression. Details in "Why prevention missed it" below.

2026-03-14 13:41 UTC — #589 closes the post-#554 source-level window

#589 (@MaxGhenis, "QRF-impute CPS-only variables for PUF clone half") adds second-stage QRF imputation for CPS-only variables on the PUF clone half, including the four Social Security subcomponents. Subcomponent sum recovers from ~$1,383B to ~$1,449B over the ~10 days the source-level window is open.

2026-04-17 18:36 UTC — Contract break from the model side

policyengine-us#8040 (@MaxGhenis, "Add Social Security retirement benefit calculation chain") merges. It makes social_security_retirement formula-backed; the observed survey value moves to social_security_retirement_reported behind the parameter gov.simulation.reported_social_security_retirement. A variable that policyengine-us-data had been emitting as a canonical input is now the same kind of variable the exporter strips by rule.

The first policyengine-us release with the formula (verified by PyPI sdist inspection) is 1.644.0. No policyengine-us-data build picks this up immediately — releases on 2026-04-18 still lock policyengine-us==1.637.0 and so are unaffected.

2026-04-29 21:04 UTC — First bad model lock

policyengine-us-data==1.88.3, commit 491ac09, locks policyengine-us==1.674.1 (which has the retirement formula). The pipeline run fails for an unrelated reason; no artifact is produced.

2026-04-30 20:47 UTC — First completed bad generated artifact

policyengine-us-data==1.89.1, commit a2f3bb3, policyengine-us==1.678.0. Pipeline succeeds. The H5 silently drops social_security_retirement (because it now has a formula, the exporter strips it), and the H5 does not emit social_security_retirement_reported as a replacement. At sim time, the model fills social_security_retirement with the reported path, which defaults to zero. Total social_security, recomputed from adds, drops from ~$1,470B to ~$540B (a ~65% understatement). Not yet promoted to live consumers.

2026-05-04 03:05 UTC — First promoted bad artifact

policyengine-us-data==1.90.1, commit f14931e, policyengine-us==1.678.0. Bad data now serving downstream consumers. None of the existing sanity checks fire — aggregate poverty stays within the wide tolerance because the loss is concentrated in seniors; the KEY_MONETARY_VARS NaN/Inf check silently skips because the variable is absent from the H5; the upload validator has no Social Security check.

2026-05-12 06:14 UTC — Dashboard scaffold

poverty-dashboard#1 (@PavelMakarchuk): initial scaffold of an internal poverty dashboard pulling from the latest policyengine-us-data artifact via a Modal backend.

2026-05-12 14:22 UTC — Detection

poverty-dashboard#2 merges, populating the 2026 baseline and adding senior poverty columns. Baseline senior SPM poverty reads ~48% — vs published SPM senior rates in the low double digits. The number is an aggregate impossibility, not a calibration miss. Investigation starts immediately.

Investigation path (captured in the gist): impossible senior SPM rate → decompose senior SPM resources → Social Security aggregate → compare stored H5 variables against the policyengine-us variable registry → identify variables dropped by _drop_formula_variables → trace policyengine-us release history to find when social_security_retirement became formula-backed → match against policyengine-us-data lockfile and pipeline history.

2026-05-12 16:24 UTC — Model-side fix

policyengine-us#8263 (@MaxGhenis) merges:

Restores social_security_retirement as the canonical input.
Removes social_security_retirement_reported.
Removes gov.simulation.reported_social_security_retirement.
Keeps lower-level SSA calculation internals but does not wire them into the canonical retirement input.
Removes several other package-side reported-data switches and public reported variables found during the same audit.

The model-side fix was chosen over a data-side workaround (emitting social_security_retirement_reported) because the long-term contract is the right place to fix it: any variable that policyengine-us-data emits should be a canonical input in policyengine-us, full stop.

2026-05-12 16:59 UTC — Data-side fix

policyengine-us-data#960 (@MaxGhenis) merges:

Requires policyengine-us>=1.691.3.
Locks policyengine-us==1.691.3.
Removes reported SPM/data helper outputs that should not be public runtime controls.
Keeps remaining data-backed leaves explicitly named as data leaves, e.g. *_data.

2026-05-12 17:02 UTC — Live regression ends

First fixed Run Pipeline completes: run 25749650363, commit 61a43e9, policyengine-us-data==1.112.2, policyengine-us==1.691.3. Detection-to-fix: 2h 40m.

2026-05-13 03:56 UTC — First post-fix promoted release

policyengine-us-data==1.113.1 published. Total Social Security back to ~$1,550B, with social_security_retirement ≈ $1,179B.

Visualizing the regressions in published H5s

The sweep below reads each Hugging Face tag's enhanced_cps_2024.h5, weights social_security/2024 and the four subcomponents/2024 by household weight, and plots the totals. One H5 at a time, discarded after read — no large local cache.

Red is stored social_security/2024 (present only in pre-#554 H5s — the PUF-imputed path). Blue is the sum of the four subcomponents at 2024. Key features:

May 2025 → Feb 17 2026: latent mechanical pattern, mild magnitude. Stored and subcomponent paths track within ~3%. The discontinuity PUF imputation overwrites social_security but not sub-components, causing base-year discontinuity #551 describes does not yet exist at scale.
Feb 18 2026 step change. Stored social_security/2024 drops from $1,434B (1.68.0) to $824B (1.69.0) in 16 hours when the new puf_impute.py ships. The simulator now disagrees with itself by ~44% at the base-year boundary.
Mar 4 2026: red line ends. Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554 removes the stored aggregate path. Blue dips ~4% temporarily (post-Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554 source-level window) and recovers after QRF-impute CPS-only variables for PUF clone half #589 on Mar 14.
Late April → mid-May 2026: blue cliff. Subcomponent sum falls from ~$1,470B to ~$540B as social_security_retirement stops being emitted to the H5. Total SS understated by ~65% at every sim year.
May 13 2026: blue recovers. 1.113.1 publishes with retirement restored.

A three-year view, with a 2.5% COLA factor applied to project 2025 and 2026 from the subcomponent sum, makes the simulator's year-over-year discontinuity in late Feb explicit — the red 2024 line crashes to $809B while the blue 2025 and green 2026 lines continue at ~$1,500B:

Underlying data and scripts: ss_sweep_results.csv, ss_sweep.py, ss_plot.py.

Incident windows

Window	Start (UTC)	End (UTC)	Duration	Aggregate SS impact at sim-year 2024
Mechanical pattern latent, ≤ ~3% gap	2024-08-24 ~14:28	2026-02-17 16:13	~543 days	≤ ~3%
Severe pre-#554 discontinuity	2026-02-18 22:11	2026-03-04 15:23	~14d 17h	~44% understatement
Post-#554 source-level mild undercount	2026-03-04 16:56	2026-03-14 13:41	~10 days	~4% understatement
Retirement loss, generated artifacts	2026-04-30 20:47	2026-05-12 17:02	~11d 20h	~65% understatement
Retirement loss, promoted artifacts	2026-05-04 03:05	2026-05-12 17:02	~8d 14h	~65% understatement
Detection → live fix	2026-05-12 14:22	2026-05-12 17:02	~2h 40m	—

Why prevention missed it

The three defenses installed after the March 2026 employment-income incident all failed to fire on the May Social Security regression:

Aggregate poverty-rate sanity test (5–30% range). Overall poverty stayed within the wide tolerance because the retirement-income loss is concentrated in seniors. A senior-stratified slice would have failed; the aggregate did not.
KEY_MONETARY_VARS NaN/Inf check on social_security. This check reads the variable from the H5 and verifies no NaN/Inf, then continues silently if the variable is missing. After #8040, social_security is no longer stored in the H5; the check is a no-op. social_security_retirement is not in KEY_MONETARY_VARS at all.
Upload-time validation in upload_completed_datasets.py. Validates file size, structure, employment-income aggregate, weight totals. No Social Security check.

The deeper pattern: the March 2026 sanity tests were shaped to the specific failure of the March incident (employment_income == 0 at every record). The May regression presented as a different shape of the same underlying contract bug — a variable that should be present is silently absent from the H5, model fills with zero — and walked past every gate.

Two of the three checks would also have missed the late-Feb pre-#554 severe discontinuity: the aggregate poverty check tolerance is wide enough, and the KEY_MONETARY_VARS check only verifies NaN/Inf in social_security not its magnitude. The upload validator was installed after the incident.

Follow-ups

Prioritized by how directly each would have caught the regressions documented here.

P0 — would have caught both regressions directly

Aggregate target check against SSA published totals. At sim-year 2024 (base year), total weighted social_security must be within ±10% of the SSA Annual Statistical Supplement aggregate (~$1.5T). Would have failed:
- On 2026-02-18 1.69.0 (stored $824B is 45% below target) → caught the severe pre-Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554 window inside hours of its start.
- On 2026-04-30 1.89.1 (recomputed $540B is 65% below target) → caught the May regression before any artifact was promoted.
Identity check at simulation level. Verify social_security == retirement + disability + survivors + dependents and analogous identities for other adds-defined aggregates. Drift > tolerance fails the build.
- Would have caught the pre-Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554 severe discontinuity ($824B stored vs $1,430B from subcomponents) — though the dollar discrepancy is sufficient that the aggregate-target check above would have caught it first.
- Would not have caught the May regression on its own: the model's adds formula sums whatever subs are present, including retirement=0, so the identity holds by construction. Needs to be paired with the aggregate target check.
Demographic-slice smoke tests. Release-blocking checks for baseline senior SPM poverty, senior market income, weighted SS recipient count by age cohort. The 48% senior SPM rate that surfaced the May regression is the same number that would have failed a 5–25% senior poverty tolerance.

P1 — would catch the contract break upstream

Canonical-inputs manifest with cross-repo check. policyengine-us-data maintains an explicit list of variables it emits. CI fails on a policyengine-us lock bump if any listed variable has become formula-backed in the new model version, unless the variable is allowlisted or a replacement input is emitted.
is_input annotation in policyengine-us. Variables that are canonical inputs declare it explicitly. The exporter pivots off declared intent, not "has a formula." A variable that has both a formula and is marked as a canonical input is the legitimate case; the exporter still emits it and the model uses the formula only when the input is absent.
Reverse-direction CI. policyengine-us PRs that touch variables on the canonical-inputs manifest run the smoke tests (P0.1, P0.2, P0.3) against the latest policyengine-us-data artifact.

P2 — durable hygiene

Release ledger. For each policyengine-us-data package release: source SHA, GitHub Actions run, Modal run ID, locked policyengine-us version, and whether artifacts were merely generated, staged, or promoted. Currently scattered across git, the Actions UI, Modal, and HF.
Public-API discipline. Public APIs and formulas use canonical model variables. Reported/data-only variables stay internal. Mostly done in #8263 and Remove reported SPM data inputs #960; codify as a checklist.
Hugging Face dataset card auto-update on revert. When a published artifact is found defective, programmatically update the HF dataset card so downstream consumers see the warning without polling our discussions.
Sweep tooling as a periodic check. The H5 sweep used to produce the plots above (one tag at a time, no local cache) is small and can run weekly. Anomalies in stored-vs-subcomponent agreement, or large jumps in any KEY_MONETARY_VAR aggregate between adjacent tags, become a release-blocking alarm.

References

#530 — top-coding bug that motivated the puf_impute.py extraction.
#551 — the base-year discontinuity bug report; magnitudes captured 5 days into the severe window. The "~3x" ratio in the summary applies to nonzero record counts; weighted-recipient and weighted-dollar ratios are ~2.0× and ~1.8× respectively.
#552 — first (closed-unmerged) attempt at PUF imputation overwrites social_security but not sub-components, causing base-year discontinuity #551 via proportional rescale.
#554 — chosen fix: formula/adds/subtracts export pruning, direct subcomponent imputation.
#570 — March 2026 prevention measures (pre-upload validation, sanity test suite).
#578 — March 2026 employment-income postmortem.
#589 — QRF imputation of CPS-only variables on PUF clone half; closes the post-Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554 source-level window.
policyengine-us#8040 — made social_security_retirement formula-backed.
policyengine-us#8263 — restored social_security_retirement as canonical input.
policyengine-us-data#960 — data-side fix and lock bump.
poverty-dashboard#2 — dashboard PR whose senior-poverty column surfaced the May regression.
First fixed Run Pipeline — 2026-05-12 17:02 UTC, commit 61a43e9.
Reproduction appendix, H5 sweep, results CSV, plots: https://gist.github.com/MaxGhenis/ddf033f52eb51e6082bade87564f7b83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postmortem: we dropped Social Security retirement from microsimulation data #968

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Postmortem: we dropped Social Security retirement from microsimulation data #968

Uh oh!

Uh oh!

MaxGhenis May 13, 2026 Maintainer

TL;DR

Chronology

2024-08-24 — Latent pattern established

2026-02-12 → 2026-02-17 — puf_impute.py extraction

2026-02-18 22:11 UTC — Severe discontinuity begins

2026-02-23 — Bug filed

2026-02-26 — Tactical fix abandoned in favor of systemic fix

2026-03-04 15:23 UTC — #554 merged; severe discontinuity ends

2026-03-04 ~20:00 UTC — Unrelated employment-income corruption

2026-03-05 — Prevention measures from #578 land

2026-03-14 13:41 UTC — #589 closes the post-#554 source-level window

2026-04-17 18:36 UTC — Contract break from the model side

2026-04-29 21:04 UTC — First bad model lock

2026-04-30 20:47 UTC — First completed bad generated artifact

2026-05-04 03:05 UTC — First promoted bad artifact

2026-05-12 06:14 UTC — Dashboard scaffold

2026-05-12 14:22 UTC — Detection

2026-05-12 16:24 UTC — Model-side fix

2026-05-12 16:59 UTC — Data-side fix

2026-05-12 17:02 UTC — Live regression ends

2026-05-13 03:56 UTC — First post-fix promoted release

Visualizing the regressions in published H5s

Incident windows

Why prevention missed it

Follow-ups

P0 — would have caught both regressions directly

P1 — would catch the contract break upstream

P2 — durable hygiene

References

Replies: 0 comments

MaxGhenis
May 13, 2026
Maintainer

2026-02-12 → 2026-02-17 — `puf_impute.py` extraction