PreTrendsPower implementation audit (Roth 2022 — PR-B) by igerber · Pull Request #466 · igerber/diff-diff

igerber · 2026-05-19T00:41:33Z

Summary

Implements Roth (2022) Section II.A-B NIS box probability as the new primary pretest_form='nis' on PreTrendsPower, compute_pretrends_power, and compute_mdv; Wald noncentral-χ² form retained as pretest_form='wald' for shipped numerical compatibility (Propositions 1+3+4 still apply — Wald acceptance region is a convex ellipsoid).
Full Σ_22 routing on CS and SA event-study adapters: non-bootstrap CS / SA now consume event_study_vcov instead of diag(ses**2). SunAbrahamResults extended with event_study_vcov + event_study_vcov_index fields built via W @ vcov_cohort @ W.T (cohort-aggregation matrix); bootstrap and replicate-weight paths clear these to preserve analytical/bootstrap SE separation.
γ-unit MDV for violation_type='linear': _get_violation_weights('linear') now consumes actual pre-period relative-time labels and skips L2 normalization, so reported MDV equals Roth's γ exactly on irregular and anticipation-shifted grids.
Custom-violation persistence: new violation_weights field on PreTrendsPowerResults lifts the PR-A NotImplementedError guard for power_at(violation_type='custom') on fresh fits; the helper API (compute_pretrends_power, compute_mdv) now accepts violation_weights and pretest_form end-to-end.
Covariance-source provenance propagation to DiagnosticReport / BusinessReport: new covariance_source field on PreTrendsPowerResults records the actual extraction path at fit time; DiagnosticReport consumes it directly so the report layer no longer guesses provenance from result-type alone (fixed the silent downgrade where correctly-computed full-VCV CS/SA fits were being tagged as moderately_powered). Legacy fallback for results lacking the field preserves the PR-A conservative downgrade contract.
New tests/test_methodology_pretrends.py (8 classes, 33 tests including 3 R-parity stubs that skip without r_pretrends_golden.json — populated in PR-C). New benchmarks/R/generate_pretrends_golden.R script committed with placeholder commit reference (PR-C pins the audited pretrends revision and lands the JSON goldens).
METHODOLOGY_REVIEW.md PreTrendsPower row promoted: In Progress → Complete (R parity pending), Last Review: 2026-05-18. REGISTRY ## PreTrendsPower section rewritten with explicit NIS + Wald equation blocks and item-by-item Requirements checklist. REPORTING.md and docs/api/_autosummary/*.rst brought into alignment with the new surface.

Methodology references

Method name(s): PreTrendsPower (compute_pretrends_power, compute_mdv), with adapter routing through CallawaySantAnnaResults / SunAbrahamResults.event_study_vcov and MultiPeriodDiDResults.interaction_indices.
Paper / source link(s):
- Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. American Economic Review: Insights, 4(3), 305-322. DOI 10.1257/aeri.20210236.
- Sun, L. & Abraham, S. (2021) — event_study_vcov construction via the W-matrix aggregation W @ vcov_cohort @ W.T.
- Paper review on file: docs/methodology/papers/roth-2022-review.md (landed via PR PreTrendsPower: Roth (2022) paper review + silent-failure guard #463).
Any intentional deviations from the source (and why):
- Wald noncentral-χ² acceptance region retained as pretest_form='wald' (paper-supported alternative under Propositions 1+3+4); not the paper's primary analysis convention but stays available for shipped users.
- Linear-violation γ-unit convention applies only when relative_times is threaded through (via fit()); legacy callers that bypass fit() and pass only n_pre keep the previous count-based L2-normalized direction for backwards compatibility — explicitly documented in REGISTRY's ## PreTrendsPower Note.
- R pretrends package parity is deferred to PR-C (mirroring Bacon's PR BaconDecomposition methodology audit (Goodman-Bacon 2021) #454 → PR BaconDecomposition R parity goldens #457 cadence); generator script benchmarks/R/generate_pretrends_golden.R is committed in this PR with a <PR-C-PIN> placeholder commit reference. The 3-tier parity contract (NIS box probability at fixed γ; γ_p MDV at target 0.5 and 0.8; γ-unit MDV invariance) is documented in the script header and locked in tests/test_methodology_pretrends.py::TestPretrendsParityR (currently @pytest.mark.skipif against the missing JSON).

Validation

Tests added/updated:
- tests/test_methodology_pretrends.py (NEW, 924 lines, 8 classes / 33 tests covering Roth Section II.A-B paper-equation-numbered Verified Components: K=1 closed-form, NIS power vs MC simulation at K=2,3 with correlated Σ_22, MDV inversion round-trip, NIS-vs-Wald differentiation, linear-units γ-scale on irregular and anticipation-shifted grids, custom-weight persistence, CS/SA full-VCV adapter routing, helper API end-to-end, JSON-serializability of to_dict, R-parity stubs).
- tests/test_pretrends.py — bulk pin to pretest_form='wald' on existing tests so shipped numerical assertions are byte-identical (only 3 tests required content-level updates because most assertions are form-invariant).
- tests/test_diagnostic_report.py — added 3 new regression tests: persisted-full_pre_period_vcov-no-downgrade, legacy-missing-field-still-downgraded, persisted-label-takes-precedence-over-inference, and legacy-MPD-without-interaction_indices reports diag_fallback.
- tests/test_business_report.py — new test_full_vcov_path_no_downgrade_on_real_cs_fit with deterministic tier pins (ratio < 0.25 → well_powered, asserted unconditionally, plus positive prose contract on summary()).
Backtest / simulation / notebook evidence: NIS power validated against Monte Carlo simulation (50k draws) at K=2 diagonal and K=3 equicorrelated Σ_22 to within 0.01 power tolerance. SA event_study_vcov diagonal-entry sanity check verifies event_study_vcov[i,i] = se(e_i)² from _compute_iw_effects at atol=1e-10. Tutorial docs/tutorials/07_pretrends_power.ipynb is unchanged — PR-B does NOT re-render the notebook (a tutorial-rendering follow-up is tracked in TODO).
Pre-PR codex review: 6 local /ai-review-local --backend codex rounds (R1-R6) plus the R7 verdict ✅ Looks good with zero findings of any severity. See feedback_local_codex_review_before_pr_submission.md.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…helper API Implements Roth (2022) Section II.A-B NIS box probability as the new primary `pretest_form='nis'` default. Wald noncentral-χ² form retained as opt-in `pretest_form='wald'` for backwards-compat with shipped numerical baselines AND as a paper-supported alternative (Wald acceptance region is a convex ellipsoid, so Propositions 1+3+4 all apply). Changes: - `PreTrendsPower.__init__`: new `pretest_form: Literal['nis', 'wald']` parameter (default 'nis'); validated to one of the two enum values; threaded through `get_params()` / `set_params()`. - New private helpers `_compute_power_nis` + `_compute_mdv_nis`: * `_compute_power_nis` uses `scipy.stats.multivariate_normal.cdf` with `lower_limit=` for the centered-box rejection probability under H1: `δ_pre = M * weights`, `Y = β̂_pre - δ_pre ~ N(0, Σ_22)`, `power = 1 - P(Y_t ∈ [-z σ_t - δ_t, z σ_t - δ_t] for all t)`. Falls back to MC simulation (N=20000) when the analytical CDF returns NaN on degenerate Σ. * `_compute_mdv_nis` solves `power_nis(M) = target_power` via doubling expansion + `optimize.brentq` bisection; non-convergence cap at M_high=1000 returns `np.inf` (mirrors Wald path's existing 1000-cap). - Renamed existing `_compute_power` → `_compute_power_wald` and `_compute_mdv` → `_compute_mdv_wald`; the unsuffixed names are now dispatchers on `self.pretest_form`. Wald math is byte-identical. - `PreTrendsPowerResults` gains 3 new fields: * `pretest_form: Literal['nis', 'wald'] = 'wald'` — default 'wald' for backwards-compat with older serialized results. * `nis_box_probability: float = np.nan` — NIS-specific acceptance probability (always NaN for Wald fits, no ambiguity). * `violation_weights: Optional[np.ndarray]` — fitted weights persisted on the result, enabling `power_at()` to work for ALL violation types on fresh fits. - `fit()` populates all three new fields and dispatches. - `power_curve()` inherits dispatch through `_compute_power`. - `summary()` and `to_dict()` dispatch on `pretest_form` — NIS fits print "Box probability:" instead of "Non-centrality parameter:". - `PreTrendsPowerResults.power_at()` refactored: uses `self.violation_weights` directly when populated, falls back to reconstruction for old serialized results (with the PR-A NotImplementedError guard retained only for custom-fit serialized results with `violation_weights=None`). - `compute_pretrends_power` and `compute_mdv` helper signatures extended to accept `violation_weights` and `pretest_form`; helpers now forward both to the class. Closes the helper/class API gap from PR-A R18. Smoke-tested with K=2 and K=3 panels: - NIS power at M=0 with K=3 ≈ 0.138 (matches 1 - (1-α)^K = 0.143 for independent normals, with off-diagonal correlation pulling it down). - Wald power at M=0 with K=3 = 0.05 (exact size under H0). - NIS MDV(80%, K=3) = 0.59, Wald MDV(80%, K=3) = 0.71 (NIS is more powerful here because the rectangular acceptance region is tighter than the chi-squared ellipse along the linear-violation direction). Pre-existing pyright type-stub warnings on `optimize.brentq` and `stats.multivariate_normal.cdf` are not touched. Plan ref: /Users/igerber/.claude/plans/stateless-prancing-iverson.md Step 2 (NIS impl + dispatcher) + Step 5 (result-class field additions + power_at refactor) + Step 6 (helper API extension). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-PR-B default was implicitly Wald (noncentral-χ²); PR-B Step 2 flipped it to NIS (box probability). The vast majority of existing tests (64 of 66) assert form-invariant properties (positive, finite, monotone, hasattr, etc.) and pass under either default. Only 3 tests needed targeted fixes: - `TestPowerComputation::test_power_at_zero_equals_alpha`: pinned `pretest_form='wald'`. The size-at-null property "power(M=0) = alpha exactly" is a Wald-form property (noncentrality = 0 at H0 yields the chi-squared distribution evaluated at its critical value). Under NIS with K=3 independent normals, the joint rejection probability at H0 is 1 - (1 - alpha)^K ≈ 0.14, not 0.05. - `TestPreTrendsPowerResultsPowerAt::test_power_at_zero`: same pin for the same reason. - `TestPreTrendsPowerResults::test_power_at_raises_on_custom_violation_type`: inverted. The PR-A R18 silent-failure guard was lifted in PR-B Step 5 (violation_weights are now persisted on PreTrendsPowerResults, so the custom path works for fresh fits). Renamed to `test_power_at_works_for_custom_violation_type` and assert finite power in [0, 1]. Added a new companion test `test_power_at_raises_on_legacy_custom_result_without_weights` that simulates an old serialized result (violation_weights cleared to None) and confirms the backwards-compat NotImplementedError guard still fires for that case. Test count: 67 (was 66; net +1 from the legacy-guard companion test). All 67 pass. Adjacent suites (test_pretrends_event_study.py and the pretrends-tagged tests in test_diagnostic_report.py) also pass under the NIS default — 31 passed, 0 failed. This is much less test churn than the plan estimated (~101 bulk pins). The form-invariance of most existing assertions means the flip is substantially less disruptive than feared. Plan ref: Step 6 (test bulk pin convention; user-locked Decision 5 in plan mode). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…study_vcov Adds full event-study covariance matrix on SunAbrahamResults, enabling PreTrendsPower to consume Roth (2022) Σ_22 on the SA path instead of falling back to diag(ses^2). Before PR-B, the SA adapter in compute_pretrends_power was forced to diag because SunAbrahamResults did not expose any event-study-level covariance surface; PR-A flagged this as the SA branch of the diagonal-VCV deviation. Construction ------------ After _compute_iw_effects() returns event_study_effects + cohort_weights, we build the aggregation matrix W in fit() and compute event_study_vcov = W @ vcov_cohort @ W.T where W is the |event_times| × n_interactions sparse aggregation matrix: event_study_vcov_index = sorted(cohort_weights.keys()) W = np.zeros((n_event_times, n_interactions)) for i, e in enumerate(event_study_vcov_index): for g, w in cohort_weights[e].items(): if (g, e) in coef_index_map: W[i, coef_index_map[(g, e)]] = w This matches the existing per-event-time variance computation at sun_abraham.py:_compute_iw_effects (which already does weight_vec @ vcov_subset @ weight_vec per event time) but batched across all event times so the off-diagonals Cov(β̂_{e_i}, β̂_{e_k}) are also produced. Smoke-test verified diagonal[i, i] of event_study_vcov matches event_study_effects[e]['se']^2 at atol=1e-10 across all event times. Bootstrap / replicate clears ---------------------------- Mirrors the CS pattern at staggered.py:2032-2036. When bootstrap_results is not None OR _uses_replicate_sa is True, event_study_vcov and event_study_vcov_index are set to None before constructing the result. This prevents PreTrendsPower from silently mixing analytical VCV with bootstrap/replicate SE overrides downstream (which would produce mis-scaled MDV/power output). Regression ---------- - 39/39 tests/test_sun_abraham.py pass. - New fields default to None on the dataclass, so existing SunAbrahamResults consumers that don't read event_study_vcov see no change. Plan ref: Step 3 SA upstream surface extension (review CRITICAL #2 resolution with explicit W-matrix pseudo-code locked in plan mode). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the hard-coded ``vcov = np.diag(ses**2)`` fallback on both the CallawaySantAnnaResults and SunAbrahamResults branches of ``_extract_pre_period_params`` with a unified routing helper ``_extract_event_study_vcov_subblock`` that consumes the full event_study_vcov sub-block when available, falling back to diag otherwise. Helper logic ------------ - When ``results.event_study_vcov`` is not None AND ``results.event_study_vcov_index`` is not None, look up each filtered pre_period via ``.index()`` and extract the ``[np.ix_(indices, indices)]`` sub-block. - Defensive guard: if ``event_study_vcov_index`` is missing one of the pre-period labels, raise ValueError loudly rather than silently falling back to diag. - When the result type does not expose event_study_vcov, return ``np.diag(ses**2)`` (the legacy behavior preserved for bootstrap fits, replicate-weight survey fits, and any future result type). Impact on the three result types -------------------------------- - ``MultiPeriodDiDResults``: unchanged — already extracts a full sub-block via interaction_indices at lines 700-708. - ``CallawaySantAnnaResults``: non-bootstrap CS fits (event_study_vcov persisted at staggered_results.py:126-128) now consume the full Σ_22 instead of diag. Bootstrap CS fits (event_study_vcov cleared at staggered.py:2032-2036) keep falling through to diag. - ``SunAbrahamResults``: non-bootstrap SA fits (event_study_vcov built via W @ vcov_cohort @ W.T in the previous commit) now consume the full Σ_22 instead of diag. Bootstrap SA fits and replicate-weight survey fits (event_study_vcov cleared by the new PR-B Step 3 SA guard) keep falling through to diag. Regression ---------- - 67/67 tests/test_pretrends.py pass. - 27/27 tests/test_pretrends_event_study.py pass. - Total 94/94 across both suites. Plan ref: Step 3 CS+SA adapter routes (closes the Σ_22 fidelity gap documented in PR-A REGISTRY ## PreTrendsPower diagonal-VCV deviation Note for non-bootstrap CS + non-bootstrap SA paths). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nit MDV Threads actual pre-period relative-time labels through ``_get_violation_weights('linear')`` and ``_extract_pre_period_params`` so the reported MDV is in Roth's γ units on irregular and anticipation-shifted grids. Closes the PR-A REGISTRY ## PreTrendsPower "Note (deviation from paper — linear violation pattern)" deviation row for the canonical fit() path. Math ---- Pre-PR-B: `weights = [n_pre-1, ..., 1, 0] / ||·||_2` derived from `n_pre` count alone (ignored relative-time labels). Under irregular grids like {-5, -3, -1}, this treated the violation as if periods were {-3, -2, -1}. After L2 normalization, the reported MDV = γ · ||t||_2, not γ — wrong units. PR-B: when `relative_times` is provided AND `violation_type='linear'`, weights = |t| directly WITHOUT L2 normalization. Then δ_pre = M * |t| = γ · t_signed under δ_t = γ · t, so M = γ exactly. Reported MDV is in Roth's γ units (slope-per-period). Verified: - Regular grid [-3, -2, -1]: weights = [3, 2, 1] - Irregular grid [-5, -3, -1]: weights = [5, 3, 1] (irregular spacing reflected — previously would have been [2, 1, 0]/||·||_2) - Backwards-compat: callers that bypass fit() and pass only n_pre keep the legacy normalized [n_pre-1, ..., 0]/||·||_2 behavior (used by ~3 unit tests + any third-party direct-helper callers). Changes ------- - `_get_violation_weights(self, n_pre, relative_times=None)`: new optional kwarg. Linear path with `relative_times not None` uses `np.abs(relative_times)` directly + early-return (skip the normalize-at-end block). All other paths (constant, last_period, custom, linear-without-relative_times) unchanged — still L2-normalized. - `_extract_pre_period_params` return type expanded from 4-tuple to 5-tuple: now returns `(effects, ses, vcov, n_pre, relative_times)`. All three adapter branches (MultiPeriodDiD, CS, SA) populate `relative_times = np.asarray(sorted_pre_periods, dtype=float)` from their respective filtered pre-period list. - `fit()` and `power_curve()` consume the new 5-tuple and thread `relative_times` into `_get_violation_weights`. End-to-end smoke test: SA fit with regular K=3 grid + NIS pretest produces an MDV ~0.087 (Roth γ scale), confirming the unit conversion is wired correctly. Regression ---------- 94/94 tests/test_pretrends.py + tests/test_pretrends_event_study.py. The 3 tests pinned to pretest_form='wald' in the previous commit still hit the wald path and retain their exact numerical baseline; the wald path uses the legacy normalized weights internally (because fit() now threads relative_times for both forms, but the wald quadratic form is scale-invariant up to M's overall scale). Plan ref: Step 4 (review CRITICAL #1 resolution: skip L2 normalization for linear-with-relative_times, locked via plan-mode AskUserQuestion). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… flip + TODO + CHANGELOG + llms.txt Documents the four PR-A TODO rows that PR-B Steps 2-6 just resolved: - Σ_22 fidelity on CS/SA adapters (full event_study_vcov sub-block routing) - Helper API gap (compute_pretrends_power + compute_mdv accept violation_weights + pretest_form) - power_at(custom) silent-failure guard (PR-A R18 mitigation lifted on fresh fits via the new persisted violation_weights field) - Linear-units γ-scale (skip L2 norm for linear-with-relative_times) Step 8 — REGISTRY.md ## PreTrendsPower: - Wholesale replacement with NIS-framed entry. - Explicit equation blocks for both NIS box probability (primary, Roth 2022 Section II.A-B) and Wald noncentral-χ² (paper-supported alternative under Propositions 1+3+4). - Three updated Notes: * Wald-alternative paper-supported Note (NEW) * Linear-convention Note (replaces the PR-A deviation Note; γ-unit MDV with relative_times threaded through fit()) * Diagonal-VCV-fallback Note narrowed to bootstrap fits only (the non-bootstrap deviation is resolved by PR-B Step 3 CS/SA routing). - Backwards-compat addendum on power_at(custom) for legacy serialized results (replaces the PR-A silent-failure-guard Note). - Item-by-item Requirements checklist with PR-B-resolved checkboxes and a single deferred-to-PR-C item (R parity). - Removed the prior Wald-test headline equation block (now subsumed by the explicit dual-form equation section). Step 9 — METHODOLOGY_REVIEW.md flip: - PreTrendsPower row status: **In Progress** → **Complete (R parity pending)**. - Last Review: 2026-05-18. - Documentation-in-place + Verified Components (10 checkboxes) + narrowed Outstanding-for-promotion to a single R-parity-fixture bullet for PR-C. Step 10 — TODO.md cleanup: - Four of five PR-A PreTrendsPower rows removed (resolved in PR-B); pointer comment in place of the removed block. - R-package-pin row rewritten as a unified PR-C tracker: "PreTrendsPower R parity goldens (PR-C)" — covers pinning the commit, running the generator script, committing the JSON, activating TestPretrendsParityR, and flipping the tracker to fully Complete. Step 11 — CHANGELOG.md [Unreleased]: - Added: 6 new PreTrendsPower bullets covering NIS impl, CS/SA Σ_22 routing + SA upstream surface, result-class field additions, helper API extension, methodology test file forward-pointer, R generator script forward-pointer. - Changed: 2 new bullets covering default pretest_form flip (implicit-Wald → explicit-NIS, with shipped Wald baselines preserved via pretest_form='wald') and linear-violation γ-scale. - Fixed: NEW section with 1 bullet documenting the PR-A R18 silent- failure guard lift for power_at(custom) on fresh fits. llms.txt (agent-facing catalog): - PreTrendsPower one-line entry expanded to mention NIS as primary default, Wald as alternative, γ-unit MDV, and Σ_22 routing. Plan ref: Steps 8-12 (REGISTRY refresh + tracker flip + TODO cleanup + CHANGELOG + agent-facing catalog), per the locked plan at /Users/igerber/.claude/plans/stateless-prancing-iverson.md. Step 7 (methodology test file) and Step 12 (R generator script) ship in the next commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Roth (2022) Section II.A-B paper-equation-numbered methodology test file. Mirrors `tests/test_methodology_bacon.py`'s structure: 8 classes, 28 tests collected (23 active + 3 R-parity skip-as-expected + 2 @pytest.mark.slow deselected by default). Classes: - `TestPretrendsHandCalculation` (8 tests): z_{1-α/2} = 1.96 at α=0.05; NIS power at H0 with diag Σ matches `1 - (1-α)^K` (independent normals); Wald power at H0 matches exactly α (chi² size); NIS power matches Monte Carlo simulation at K=2 diag (atol=0.01); NIS power matches MC at K=3 with ρ=0.3 equicorrelation; MDV(target=0.8) round- trip — power(MDV) = 0.80; NIS power monotone in |M|; NIS MDV non- convergence cap returns np.inf. - `TestPretrendsPropositions` (2 @pytest.mark.slow tests): Proposition 1 (conditional mean) matches MC at atol=0.01; Proposition 4 (variance reduction under convex B_NIS) — conditional Var ≤ unconditional Var. - `TestPretrendsLinearGrid` (4 tests): regular grid [-3, -2, -1] → weights = [3, 2, 1]; irregular grid [-5, -3, -1] → weights = [5, 3, 1]; no L2 normalization (||weights||_2 ≠ 1); backwards-compat fallback produces the legacy [n-1, ..., 0] / ||·||_2 direction. - `TestPretrendsCustomWeightPersistence` (2 tests): custom weights persisted on PreTrendsPowerResults (L2-normalized); power_at(M) for custom matches fresh fit(M=M). - `TestPretrendsCovarianceSource` (3 tests): SunAbrahamResults event_study_vcov populated on non-bootstrap fits; diagonal matches per-event-time SEs at atol=1e-10; non-trivial off-diagonals (proves full sub-VCV path, not silent diag fallback). - `TestPretrendsHelperAPI` (3 tests): compute_pretrends_power and compute_mdv accept violation_weights for custom + pretest_form toggle. - `TestPretrendsNISvsWald` (3 tests): default pretest_form is 'nis'; Wald path preserves pre-PR-B output shape; NIS and Wald produce different power values under correlated Σ (constructed counter- example with ρ=0.6). - `TestPretrendsParityR` (3 tests, all skip when goldens missing): stubs for PR-C R `pretrends` package parity at atol=1e-6. Skip decorator checks for `benchmarks/data/r_pretrends_golden.json`. 26/26 collected tests pass (after deselecting the 2 @pytest.mark.slow Proposition simulation tests per the default pytest addopts). Plan ref: Step 7 (methodology test file). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…en.R R `pretrends` parity goldens generator script (PR-C deferred to land the JSON output). Mirrors the Bacon precedent (`generate_bacon_golden.R` ships in PR-B, JSON goldens deferred to PR-C following Bacon PR-C #457). Structure --------- Three fixtures matched to test_methodology_pretrends.py expectations: 1. `uniform_3_pre_periods_no_anticipation` — K=3 regular grid (t ∈ {-3, -2, -1}), never-treated control. Default-case parity baseline. 2. `irregular_pre_periods` — K=3 with relative_times = [-5, -3, -1]. Tests PR-B's γ-unit linear-pattern fix; pre-PR-B Python with normalized count-based weights would have silently reported MDV in non-γ units. R `slope_for_power()` always reports γ. 3. `anticipation_shifted` — K=4 with anticipation=1. Verifies the pre-period filtering logic in `_extract_pre_period_params`. Three-tier parity contract at atol=1e-6: 1. NIS box probability `P(β̂_pre ∈ B_NIS(Σ))` at fixed γ values on all 3 fixtures. 2. γ_p MDV (slope at target power 0.5 and 0.8) on regular and irregular grids. 3. γ-unit MDV invariance: Python's PR-B Step 4 "skip-L2-norm" path produces MDV in Roth's γ units exactly, matching R's `slope_for_power()` which also reports γ. PR-C TODO checklist (recorded at the bottom of the script for self-contained PR-C handoff): - Replace `<PR-C-PIN>` commit-hash placeholder with actual git SHA from https://github.com/jonathandroth/pretrends. - Replace the NA_real_ stubs in `extract_pretrends()` with actual `pretrends::pretrends()` / `slope_for_power()` calls (the package API surface is documented in the script header but not yet exercised — PR-C is when it gets installed and pinned). - Verify REGISTRY.md surface claims against the pinned revision. - Activate `tests/test_methodology_pretrends.py::TestPretrendsParityR` (currently skips via @pytest.mark.skipif when the JSON is missing). - Flip METHODOLOGY_REVIEW.md PreTrendsPower row to fully **Complete**. The script's R `pretrends` calls are stubbed in PR-B because the package is not installed on the audit machine; PR-C installs it, pins the audited commit, runs the script, captures the actual JSON output, and commits both the JSON and the updated R script with the real surface calls. Plan ref: Step 12 (R generator script + commit reference; goldens deferred to PR-C following the Bacon cadence). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex Round 1 verdict was ⛔ Blocker — 2 unmitigated P0 defects in the new default `pretest_form='nis'` path plus a P1 contract bug on the MPD branch and a P2 code-duplication issue. All four addressed. P0 #1 — `_compute_mdv_nis` low-target-power boundary bug: - Pre-fix: when `target_power ≤ NIS-size` (e.g., α=0.05, K=3 → null_size ≈ 0.143, request target=0.10), the bracketing loop saw `power(0) >= target` immediately, `brentq(0, 1)` raised ValueError on the non-bracketing bounds, and the except-fallback silently returned `M_high=1.0` instead of 0.0. - Post-fix: explicit short-circuit `if power_minus_target(0) >= 0: return 0.0` BEFORE the doubling loop. Regression test added at `TestPretrendsHandCalculation::test_mdv_nis_returns_zero_when_target_below_null_size`. P0 #2 — non-finite scipy MVN CDF propagation: - Pre-fix: `_compute_power_nis` and `PreTrendsPowerResults.power_at` only fell back to MC simulation on `ValueError` / `LinAlgError` exceptions; if scipy returned NaN directly (Genz internal cancellation on degenerate Σ), the NaN propagated through `np.clip` and into the MDV solver — silently producing a wrong-but-finite MDV via the brentq fallback path. - Post-fix: extracted module-level helper `_compute_nis_acceptance_prob` that does the analytical scipy CDF call AND falls back to MC on EITHER exception OR non-finite output. Both call sites (`_compute_power_nis` and `PreTrendsPowerResults.power_at`) now use the helper — eliminates duplication AND fixes the NaN-propagation hole. Regression test monkey-patches `scipy.stats.multivariate_normal.cdf` to return NaN and asserts MC fallback engages (`test_nis_power_handles_non_finite_cdf_via_mc_fallback`). P1 — MultiPeriodDiD raw period IDs treated as Roth relative times: - Pre-fix: `_extract_pre_period_params` MPD branch passed `np.asarray(estimated_pre_periods, dtype=float)` directly into `_get_violation_weights('linear')`. For the common MPD case `pre_periods=[0, 1, 2, 3], reference_period=4`, this produced linear weights `[0, 1, 2, 3]` (raw period IDs) instead of Roth-style `[4, 3, 2, 1]` (|t - reference|). Non-numeric period labels (string IDs, calendar dates) would have failed outright. - Post-fix: derive relative_times from `results.reference_period`: `[float(p) - float(ref) for p in estimated_pre_periods]`. When `reference_period` is None or non-numeric, fall back to legacy count-based path (returns `relative_times=None`, `_get_violation_weights` uses the normalized `[n_pre-1, ..., 0]/||·||_2` direction). Type signature of `_extract_pre_period_params` widened to `Tuple[..., Optional[np.ndarray]]`. Regression test at `TestPretrendsLinearGrid::test_mpd_calendar_period_ids_derive_relative_times_from_reference` verifies `pre_periods=[0,1,2,3], reference_period=4` → weights `[4, 3, 2, 1]` (and exercises the derivation math directly). P2 — code duplication of NIS box-probability logic: - Resolved structurally by the module-level helper extraction (above). - The two call sites are now thin wrappers; future contract changes to the box probability happen in one place. Regression ---------- - 23 active methodology tests pass (3 R-parity stubs still skip). - 67/67 test_pretrends.py + 27/27 test_pretrends_event_study.py unchanged. - Total 120/120 across the three suites (+ 3 expected R-parity skips). Plan ref: HARD GATE Step 13 Round 1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 codex review findings on the R1-cleaned NIS implementation; all four items addressed in this single commit. **P0: NIS MDV cap-vs-target ambiguity** (pretrends.py _compute_mdv_nis) The pre-fix doubling loop exited on the `M_high < 1000` cap and immediately returned ∞ even when power(M_high) >= target — silently producing a wrong MDV=∞ result on finite-root cases. Codex's concrete counterexample: vcov=[[50000]] with target_power=0.8 has a finite root between M=512 and M=1024 (power(512)≈0.36, power(1024)≈0.997). Pre-fix the cap fired at M_high=1024 and returned ∞; brentq could have bracketed. Post-fix: evaluate power_minus_target(M_high) explicitly after the loop exits. Return ∞ only when power at the capped endpoint is still below target_power. Finite-root cases at the boundary now pass through to brentq. The genuine-unreachable case (vcov=[[1e8]], target=0.99) still returns ∞ as before. **P1: scipy version pin** (pyproject.toml) `scipy.stats.multivariate_normal.cdf(..., lower_limit=...)` — used by the new `_compute_nis_acceptance_prob` for the rectangular box probability — requires the `lower_limit` parameter introduced in scipy 1.10. Bump from `scipy>=1.7.0` to `scipy>=1.10` with an explanatory comment referencing the release-notes link. Without this bump callers on older scipy would hit a TypeError at the first NIS power call. **P1: pretest_form not propagated to PreTrendsPowerCurve** `PreTrendsPower.power_curve()` constructs a PreTrendsPowerCurve dataclass but did NOT pass through the form used to compute the grid — so a NIS fit's `result.power_curve(...)` returned a curve indistinguishable from a Wald curve at the dataclass surface. Fix: add a `pretest_form: Literal['nis', 'wald'] = 'wald'` field to the dataclass (default 'wald' for backwards-compat with old serialized curves); populate it from `self.pretest_form` in `power_curve()`; surface it on `to_dataframe()` as a new "pretest_form" column so downstream tooling that ingests the curve can disambiguate NIS vs Wald output. **P2: MPD relative-times regression test was manual arithmetic** The R1-fix added `test_mpd_calendar_period_ids_derive_relative_times_from_reference` but the test only checked Python's subtraction operator, never invoking the production `_extract_pre_period_params` MPD branch. Replace with an end-to-end test that constructs a real `MultiPeriodDiDResults` and calls the helper directly; assert that the returned `relative_times` is `[-4, -3, -2, -1]` for `pre_periods=[0,1,2,3]` + `reference_period=4`, and that the downstream `_get_violation_weights` produces `[4, 3, 2, 1]`. Also add a companion test that confirms the non-numeric-reference branch falls back to `relative_times=None` (preserves legacy direction). **R2 P0 regression test: finite-root MDV at doubling endpoint** New `test_mdv_nis_finite_root_at_doubling_endpoint` reproduces codex's concrete counterexample (vcov=[[50000]], target_power=0.8) and asserts the post-fix returns a finite MDV in (512, 1024) AND spot-checks that the brentq root achieves the target power within 1e-3. Locks the cap-vs-target contract against any future regression in the MDV solver. **Pyright-stub annotations** Coerce `z_alpha = float(stats.norm.ppf(...))` at both NIS call sites so the `_compute_nis_acceptance_prob(M, weights, vcov, z_alpha)` argument matches the helper's `z_alpha: float` signature; coerce the `_compute_power_nis` return tuple elements to floats explicitly; add a targeted `# type: ignore[arg-type]` on the `cov=vcov` kwarg in `stats.multivariate_normal.cdf` (scipy stub bug — `cov` typed `int`). These do not affect runtime; they preserve the new code's type-cleanliness against IDE-level Pyright. Tests: 122/122 pass (3 R-parity stubs skip; 2 slow tests deselected). SA upstream regression: 39/39 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R3 codex review flagged a cross-surface drift after R2 cleared: PR-B Step 3 routed CS / SA fits through the full Σ_22 sub-block at the estimator layer, but ``DiagnosticReport._infer_cov_source`` kept the pre-PR-B type-based inference. That returned ``"diag_fallback_available_full_vcov_unused"`` for any CS / SA fit with populated ``event_study_vcov`` — and ``_apply_diag_fallback_downgrade`` then conservatively downgraded ``well_powered`` to ``moderately_powered``. Net effect: correctly-computed full-VCV pre-trends power blocks were silently being downgraded across the entire DR / BR rendering surface. **P1 fix — provenance recorded at the estimator layer, consumed at the report layer:** - ``pretrends.py``: - ``_extract_event_study_vcov_subblock`` now returns ``(vcov, source)`` where ``source`` is ``"full_pre_period_vcov"`` when the full sub-block was actually used or ``"diag_fallback"`` when ``event_study_vcov`` was missing / cleared. - ``_extract_pre_period_params`` extended to a 6-tuple that includes the ``covariance_source`` label. MPD branch returns ``"full_pre_period_vcov"`` when ``interaction_indices`` is populated, ``"diag_fallback"`` otherwise; CS / SA branches forward the label from the sub-block helper. - ``PreTrendsPowerResults`` gains a new ``covariance_source: str`` field (default ``"unknown"`` for backwards-compat with old serialized results). ``fit()`` populates it from the extraction path; ``power_curve()`` discards it because the curve dataclass is independent of any one fit's provenance. - ``diagnostic_report.py``: - ``_check_pretrends_power`` and ``_format_precomputed_pretrends_power`` now prefer the persisted ``pp.covariance_source`` field and fall back to ``_infer_cov_source(source_fit)`` only when the field is missing or ``"unknown"`` (legacy results). - ``_infer_cov_source`` updated: CS / SA + populated ``event_study_vcov`` now correctly returns ``"full_pre_period_vcov"`` (since PR-B routes through the full matrix). The legacy ``"diag_fallback_available_full_vcov_unused"`` sentinel is no longer produced by any in-tree path. - ``_apply_diag_fallback_downgrade`` docstring updated to note that the rule is now effectively a no-op post-PR-B; the function is retained for backwards-compat with legacy serialized results carrying the old sentinel. This is the architectural fix the R3 codex reviewer called out: "stop inferring covariance provenance from result type alone — record it on ``PreTrendsPowerResults`` when the covariance is built". The result-type heuristic was a maintainability smell that drifted the moment the estimator-layer routing changed. **P3 fix — autosummary docs include new fields:** - ``docs/api/_autosummary/diff_diff.PreTrendsPowerResults.rst`` adds ``pretest_form``, ``nis_box_probability``, ``violation_weights``, ``covariance_source`` to the attribute autosummary so the published API page is complete after the default flip to NIS. - ``docs/api/_autosummary/diff_diff.PreTrendsPowerCurve.rst`` adds ``pretest_form``. - ``docs/api/pretrends.rst`` example code shows the NIS-default flow and demonstrates how ``pretest_form='wald'`` opts back into the pre-PR-B numerical output. **P3 fix — REPORTING.md aligned with the new estimator-layer reality:** - ``docs/methodology/REPORTING.md`` "Diagonal-covariance fallback for staggered-estimator power" note rewritten to describe the PR-B routing accurately. Non-bootstrap CS / SA fits now consume the full ``event_study_vcov`` sub-block; the PR-A conservative downgrade is effectively dead post-PR-B (still defended for legacy serialized results). Bootstrap and replicate-weight CS / SA, plus ImputationDiD / Stacked / EfficientDiD / TwoStageDiD, still fall through to diag because nothing better is available on those result types yet. **Tests:** - Rewrote ``test_precomputed_pretrends_power_downgrades_when_full_vcov_unused`` as ``test_precomputed_pretrends_power_full_vcov_yields_no_downgrade``: the same CS-shaped stub that used to be downgraded now correctly resolves to ``covariance_source='full_pre_period_vcov'`` and keeps the ``well_powered`` tier. - New ``test_precomputed_pretrends_power_consumes_persisted_cov_source`` explicitly constructs a ``PreTrendsPowerResults`` with the new field set and verifies the report adapter reads it directly (locks the architectural fix). - Updated the two MPD-branch tests + the SA sub-block test in ``test_methodology_pretrends.py`` to unpack the new 6-tuple and assert the expected ``covariance_source`` label. Tests: 583 pass across pretrends + DR + BR + SA + staggered. 4 skipped (R-parity stubs + 1 unrelated). 0 regressions. CHANGELOG ``Fixed`` entry added under ``[Unreleased]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R4 codex review sharpened a backwards-compat hole left open by R3: **P1 — legacy missing-field fallback was upgrading silently** R3 changed ``DiagnosticReport._infer_cov_source`` so CS / SA fits with populated ``event_study_vcov`` returned ``"full_pre_period_vcov"`` on the (now-architected) assumption that the field's presence implies PR-B routing was used. That assumption is wrong for LEGACY ``PreTrendsPowerResults`` objects pre-PR-B: those results were computed from ``diag(ses**2)`` even though the source fit's ``event_study_vcov`` was populated (PR-A behavior). Without the persisted ``covariance_source`` label we cannot distinguish a pre-PR-B fit from a post-PR-B fit, so the legacy ambiguous case must keep the conservative downgrade. Fix: ``_infer_cov_source`` reinstates the ``"diag_fallback_available_full_vcov_unused"`` sentinel for legacy CS / SA results with populated ``event_study_vcov``. New PR-B fits set ``PreTrendsPowerResults.covariance_source`` directly and bypass this fallback entirely, so the sentinel only fires for legacy serialized results — preserving the PR-A downgrade contract for backwards-compat while keeping the new full-VCV no-downgrade contract intact for fresh fits. ``_apply_diag_fallback_downgrade`` docstring updated. Tests rewritten in ``tests/test_diagnostic_report.py``: - ``test_precomputed_pretrends_power_persisted_full_vcov_no_downgrade``: constructs a real ``PreTrendsPowerResults`` with the new field set to ``full_pre_period_vcov`` (the value PR-B records on fresh fits) and verifies no downgrade. - ``test_precomputed_pretrends_power_legacy_missing_field_still_downgraded`` (NEW): legacy ``_PPStub`` without the field exercises the fallback path; the report adapter MUST emit the conservative sentinel + downgrade to ``moderately_powered``. - ``test_precomputed_pretrends_power_consumes_persisted_cov_source`` (rewritten): the persisted ``full_pre_period_vcov`` label takes precedence over the legacy fallback (which would now produce the sentinel). Locks the architectural fix. **P3 — to_dict() / to_dataframe() drop violation_weights and covariance_source** The PR-B-added public fields are missing from ``PreTrendsPowerResults`` serialization surfaces. Any caller that round-trips through dict / dataframe loses the very provenance the reporting layer reads off the dataclass. Fix: include both fields in ``to_dict()`` (and therefore ``to_dataframe()`` via the existing single-source-of-truth path). **P3 — BR-level live regression for the new full-VCV no-downgrade path** The existing ``TestDiagFallbackDowngradeAppliedCentrally`` class only locked the downgrade fires on the diag fallback. The new full-VCV no-downgrade path was uncovered at the user-facing prose surface. New ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` exercises the live CS path: when ``pretrends.py`` records ``covariance_source='full_pre_period_vcov'`` on a real fit and the headline tier is ``well_powered``, ``BusinessReport.full_report()`` must NOT contain the moderately-informative phrasing. Tests: 400 across pretrends + DR + BR pass. 4 skipped (R-parity stubs + 1 fixture skip). SA + CS regression: 185 pass (no change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R5 codex review on the R4 cleanup. Verdict was ✅ "Looks good" but with three actionable items (1 P2 + 2 P3); polishing all three to land a clean verdict per the HARD GATE rule "informational items only". **P2 — `to_dict()` was not actually JSON-serializable** R3 added the docstring "suitable for serialization or downstream transport" on ``PreTrendsPowerResults.to_dict()`` but the body kept emitting ``violation_weights`` as a raw ``np.ndarray``, so ``json.dumps(result.to_dict())`` raised ``TypeError``. Fix: coerce ``violation_weights`` to ``list[float]`` (or ``None``) inside ``to_dict``; the ndarray remains available on the dataclass attribute directly for callers that need it. New regression in ``tests/test_methodology_pretrends.py``: ``test_to_dict_is_json_serializable`` asserts both the type contract and an end-to-end ``json.dumps`` round-trip on a real fit, with ``allow_nan=True`` (NIS box probability returns floats; Wald noncentrality returns floats; both can be finite or NaN). **P3 — BR live regression was checking the wrong surface** R4's ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` checked ``BusinessReport.full_report()`` for the absence of the moderately-informative phrasing. But ``business_report.py`` actually renders the well-powered phrasing on ``summary()`` (the in-sample prose surface), not ``full_report()`` — so the prior assertion silently passed on text that did not contain the well-powered prose in the first place. Fix: assert positively that ``summary()`` contains ``"well-powered"`` and lacks ``"moderately informative"``; retain the ``full_report()`` negative check as a secondary defensive assertion. **P3 — legacy MPD fallback was overstating provenance** R4's ``_infer_cov_source`` unconditionally returned ``"full_pre_period_vcov"`` for non-event-study types — which silently included ``MultiPeriodDiDResults`` fits where ``interaction_indices`` is ``None``. In that case ``pretrends.py:_extract_pre_period_params`` falls through to ``np.diag(ses**2)`` (a genuine fallback, not the "available but unused" sentinel), so the legacy-fallback label was wrong. Fix: ``_infer_cov_source`` special-cases ``MultiPeriodDiDResults``. When ``vcov`` and ``interaction_indices`` are both populated, returns ``"full_pre_period_vcov"``; otherwise returns ``"diag_fallback"``. The ``diag_fallback`` label does NOT trigger ``_apply_diag_fallback_downgrade`` (only the ``diag_fallback_available_full_vcov_unused`` sentinel does), so the tier is unchanged for the MPD legacy case — this fix is provenance accuracy, not tier behavior. New regression in ``tests/test_diagnostic_report.py``: ``test_precomputed_pretrends_power_legacy_mpd_without_interaction_indices_reports_diag`` constructs a legacy-shaped MPD stub without ``interaction_indices`` / ``vcov`` and asserts ``covariance_source == "diag_fallback"`` (was ``"full_pre_period_vcov"`` pre-fix) and that the tier stays at ``well_powered`` (no downgrade applies on ``diag_fallback``). Tests: 402 pass across pretrends + DR + BR. 4 skipped (R-parity stubs + 1 fixture skip). SA + CS regression: 185 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 codex verdict ✅ "Looks good" but with two P3 polish items. **P3 — `_infer_cov_source` docstring drifted from new MPD special-case** R5 added an explicit MPD branch to ``_infer_cov_source`` that returns ``"diag_fallback"`` when ``interaction_indices`` is absent, but the docstring's ``"full_pre_period_vcov"`` bullet still claimed all non-event-study types (including MPD) "always" expose full pre-period covariance. Fix: update the docstring so the ``"full_pre_period_vcov"`` bullet excludes MPD (with a forward pointer to the explicit MPD branch below), and the ``"diag_fallback"`` bullet enumerates the MPD-without- ``interaction_indices`` case. **P3 — BR no-downgrade live regression was conditionally bypassed** The R5-fixed ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` gated the well-powered phrasing assertions on ``if block["tier"] == "well_powered"``, which silently skipped the key prose assertion if a future regression reintroduced the conservative downgrade (the test then passes trivially). Fix: pin the expected tier deterministically on the ``cs_fit`` fixture, which produces ``mdv/|att| ≈ 0.053`` (well under the ``0.25`` well_powered threshold) on ``seed=7`` + ``treatment_effect=1.5``. New assertions: - ``block["covariance_source"] == "full_pre_period_vcov"`` (asserted, not guarded) - ``block["mdv_share_of_att"] < 0.25`` (asserts the raw ratio is in the well_powered range so the no-downgrade assertion below is meaningful) - ``block["tier"] == "well_powered"`` (locks the no-downgrade contract — a regression reintroducing the downgrade would fail here, not silently bypass) The well-powered / moderately-informative prose contracts on ``summary()`` and ``full_report()`` are now also unconditionally asserted. Tests: 125 pass on the impacted classes (BR centralized-downgrade + all methodology + all DR). No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T00:51:15Z

Overall Assessment

⚠️ Needs changes

Executive Summary

PreTrendsPower is the affected method, and the main unmitigated issue is methodological: the new linear γ-unit path still silently falls back to the legacy count-based direction for MultiPeriodDiDResults with non-numeric period labels.
That fallback is now codified in a new test, but it is not documented in REGISTRY.md, while REGISTRY.md and METHODOLOGY_REVIEW.md now say the γ-unit deviation is resolved.
The new full-Σ_22 routing for non-bootstrap CS/SA fits and the report-layer covariance provenance fix look internally consistent; I did not find a new NaN/inference anti-pattern in the changed statistical code.
The R parity deferral is properly treated as deferred work, but the committed generator is still a placeholder script.
The edited API page still documents the wrong custom-violation parameter name.

Methodology

P1 diff_diff/pretrends.py:L914-L938, tests/test_methodology_pretrends.py:L530-L564, docs/methodology/REGISTRY.md:L2820-L2820, METHODOLOGY_REVIEW.md:L1054-L1066: the new linear-weight implementation only produces Roth γ-units when it can coerce MPD labels to floats. For supported MPD inputs with non-numeric period labels, it silently returns to the old count-based normalized direction, but the registry/review now describe the γ-unit deviation as resolved. Impact: silently mis-scaled MDV/power output and an undocumented methodology deviation on a changed estimator path. Concrete fix: either derive relative times for datetime/Period labels and add an explicit override for arbitrary labels, or raise/warn on unsupported labels and document that exception in REGISTRY.md/METHODOLOGY_REVIEW.md instead of marking the γ-unit path complete.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 benchmarks/R/generate_pretrends_golden.R:L105-L123, TODO.md:L97-L98: the R parity generator is still a placeholder (NA_real_ stubs and <PR-C-PIN>), but this is explicitly deferred and tracked. Impact: not a blocker for this PR because parity tests are skip-gated and the follow-up is recorded. Concrete fix: leave it for PR-C as tracked, or replace the stubs now if you want the generator to be runnable immediately.

Security

No findings.

Documentation/Tests

P3 docs/api/pretrends.rst:L135-L136: the edited API page still tells users to pass custom_delta, but the public API uses violation_weights. Impact: copy-pasted documentation for the custom path is wrong on the exact surface this PR changed. Concrete fix: rename the parameter to violation_weights and mention that both PreTrendsPower and the helper functions now accept it.
Review note: I could not execute pytest here because this environment is missing pytest, numpy, and scipy.

Path to Approval

Remove the silent MPD fallback as an unqualified “resolved” path: support real relative-time extraction for non-numeric time types where distances are well-defined, and otherwise warn or raise instead of silently reporting a Roth γ-unit MDV.
Add regression coverage for that contract: one MPD fixture with datetime/Period labels that preserves γ-units, and one unsupported-label fixture that warns/errors; then update REGISTRY.md and METHODOLOGY_REVIEW.md to match the implemented behavior exactly.

R8 CI codex caught a P1 my local R7 reviewer missed — exactly the `feedback_local_codex_vs_ci_codex_divergence.md` pattern. **P1 — MPD non-numeric labels silently fell back to count-based, undocumented as a deviation in REGISTRY** R3's MPD branch returned `relative_times=None` for non-numeric `reference_period` values (string period IDs, etc.), silently using the legacy count-based normalized direction — but the REGISTRY note described the γ-unit deviation as "resolved" without qualifying that exception. Two-part fix: 1. **Better coercion** for datetime-like labels: new module-level helper `_coerce_relative_times_from_reference` (`pretrends.py:92`) handles three regimes: - Numeric (`int` / `float` / `np.int64`) — direct `float()` - `pandas.Period` / `Timestamp` / `np.datetime64` — subtraction- based offset arithmetic (`.n` for Period, `.days` for Timedelta, fall through to `/ np.timedelta64(1, 'D')`) - Genuinely non-numeric (string period IDs, unranked categoricals) — emits an explicit `UserWarning` documenting that the reported MDV is NOT in Roth's γ units under this fallback, and recommends re-fitting with numeric labels. 2. **Documentation alignment**: REGISTRY `## PreTrendsPower` convention note and METHODOLOGY_REVIEW.md `## PreTrendsPower` Verified Components checklist both enumerate the supported label types (numeric + pandas.Period + Timestamp + datetime64) and explicitly call out the non-numeric warn-and-fallback behavior as a documented edge case (not a "resolved" deviation). **P3 — `docs/api/pretrends.rst` still referenced removed `custom_delta` parameter name** The custom-violation entry in the violation-types section used the parameter name `custom_delta`, but the actual API exposes `violation_weights` (both on `PreTrendsPower` and on the helper functions per PR-B Step 6). Fix: rename in docs and add a one-line note that both the class and the helpers accept the kwarg. **Tests** (`tests/test_methodology_pretrends.py::TestPretrendsLinearGrid`): - `test_mpd_non_numeric_reference_falls_back_to_legacy_weights` renamed to `..._warns_and_falls_back...` and now asserts the explicit `UserWarning` is emitted (mentioning "γ units"). - NEW `test_mpd_pandas_period_reference_yields_numeric_relative_times`: constructs a `MultiPeriodDiDResults` with `pd.Period('2019Q1..Q3')` pre-periods and `reference_period=pd.Period('2019Q4')`, asserts the derived `relative_times == [-3, -2, -1]` (quarters) and linear weights = `[3, 2, 1]` in γ units. Locks the Period-arithmetic path the codex specifically flagged. The P3 R-parity-script placeholder is deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). Tests: 403 pass across pretrends + DR + BR. 4 skipped (R-parity stubs + 1 fixture skip). No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T01:23:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cfb3200d1b4d6b918c0aaabeded3e4be6c5603ab

Overall Assessment

✅ Looks good

Executive Summary

This re-review resolves the prior P1: PreTrendsPower now handles non-numeric MultiPeriodDiDResults period labels by warning and falling back explicitly, and the registry/review docs now describe that fallback consistently (diff_diff/pretrends.py:L92-L172, docs/methodology/REGISTRY.md:L2818-L2820, METHODOLOGY_REVIEW.md:L1059-L1067).
Affected methods are PreTrendsPower, compute_pretrends_power, compute_mdv, plus the CS/SA covariance adapters consumed by compute_pretrends_power.
I did not find any new P0/P1 issues in the changed estimator math, Σ_22 routing, variance handling, or report-layer provenance logic.
The NIS default, retained Wald alternative, custom-weight persistence, and non-bootstrap CS/SA full-Σ_22 routing are all aligned with the updated Methodology Registry.
R-package parity is still intentionally deferred and properly tracked in TODO.md; that is informational only.
I could not run the test suite in this environment because numpy/scipy are unavailable, so the verification here is source-level plus review of the added regression coverage.

Methodology

No findings. The changed methodology surfaces are PreTrendsPower, compute_pretrends_power, compute_mdv, and the CS/SA event-study covariance extraction path. The new NIS default and the retained Wald form match the updated registry contract, and the prior undocumented MPD-label fallback is now both explicit in code and documented (diff_diff/pretrends.py:L39-L89, diff_diff/pretrends.py:L175-L237, diff_diff/pretrends.py:L820-L1125, docs/methodology/REGISTRY.md:L2783-L2856).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: R parity is still deferred; the committed generator and parity tests are intentionally placeholder/skip-gated, so this is not a blocker because it is tracked in TODO.md. Concrete fix: in PR-C, pin the R pretrends revision, replace the NA_real_ stubs, commit benchmarks/data/r_pretrends_golden.json, and activate the parity assertions (benchmarks/R/generate_pretrends_golden.R:L1-L24, benchmarks/R/generate_pretrends_golden.R:L107-L123, benchmarks/R/generate_pretrends_golden.R:L189-L223, TODO.md:L97-L98).

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the release/reporting docs overstate the legacy covariance-provenance behavior. They say the old sentinel path is gone / that legacy inference now labels CS/SA + event_study_vcov as full_pre_period_vcov, but the implementation intentionally still emits diag_fallback_available_full_vcov_unused for legacy precomputed results without persisted provenance. The code and tests are correct; the docs are the inconsistent piece. Concrete fix: reword the docs to distinguish “new fits with persisted covariance_source” from “legacy serialized results using the conservative sentinel fallback” (CHANGELOG.md:L39-L39, docs/methodology/REPORTING.md:L324-L339, diff_diff/diagnostic_report.py:L1535-L1544, tests/test_diagnostic_report.py:L434-L479).
Note: I could not execute pytest here because this environment is missing runtime dependencies (numpy/scipy); static AST parsing of the touched Python files succeeded.

R9 verdict ✅ Looks good with one actionable P3: CHANGELOG.md L39 and REPORTING.md L324-342 overstated the legacy covariance-source path. Both said the `"diag_fallback_available_full_vcov_unused"` sentinel is gone / that legacy inference labels CS / SA + `event_study_vcov` as `"full_pre_period_vcov"`, but the R4 fix RESTORED the conservative sentinel for legacy precomputed results that lack the persisted `covariance_source` field — because without it we cannot distinguish a pre-PR-B fit (used diag) from a post-PR-B fit (used full Σ_22). Code and tests are correct; docs were the inconsistent piece. Fix: reword both surfaces to distinguish two paths explicitly: - **New fits** (post-PR-B): persisted `covariance_source` is read directly, non-bootstrap CS / SA report `"full_pre_period_vcov"` and are NOT downgraded. - **Legacy serialized results** (pre-PR-B, no field): legacy type-based inference still emits the conservative sentinel for CS / SA + populated `event_study_vcov`, and the `well_powered → moderately_powered` downgrade still applies. For legacy `MultiPeriodDiDResults` without `interaction_indices`, the fallback reports `"diag_fallback"` (genuine fallback, no downgrade). CHANGELOG entry expanded to list all four covariance-source DR regression tests by name; REPORTING.md "Pre-period covariance routing for staggered-estimator power" Note rewritten with the two-path structure. R-parity P3 deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). No source changes; 309 tests across pretrends + DR + BR continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T10:31:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 02b74a8ba4cd43d11458259f4f7bd88a15ed9c62

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review item on covariance-provenance reporting appears resolved: DiagnosticReport / BusinessReport now prefer persisted covariance_source, and the legacy fallback is documented consistently (diff_diff/diagnostic_report.py:L1442-L1497, docs/methodology/REPORTING.md:L324-L350, tests/test_diagnostic_report.py:L430-L480, tests/test_business_report.py:L2423-L2498).
I did not find any unmitigated P0/P1 issues in the changed estimator math, Σ_22 routing, MDV/power dispatch, or report-layer provenance handling.
The methodology-affecting surfaces are PreTrendsPower, compute_pretrends_power, compute_mdv, _extract_pre_period_params, and the CS/SA covariance adapters.
One P3 documentation/tests issue remains: several touched docstrings/changelog lines still describe pre-PR-B behavior.
R-package parity is still deferred to PR-C and is properly tracked in TODO.md.
I could not run pytest here; this review is source-level plus AST parsing of the touched Python files.

Methodology

No findings. The changed methodology surfaces align with the updated registry and the cited Roth paper: NIS is now the primary box-pretest form, detectable-slope reporting is in the paper’s γ_p framing, and the retained Wald branch is explicitly treated in the registry as the documented convex-region alternative rather than an undocumented deviation (diff_diff/pretrends.py:L39-L89, diff_diff/pretrends.py:L905-L1125, diff_diff/pretrends.py:L1209-L1504, diff_diff/pretrends.py:L1664-L1775, diff_diff/sun_abraham.py:L94-L104, diff_diff/sun_abraham.py:L782-L810, diff_diff/sun_abraham.py:L948-L955, docs/methodology/REGISTRY.md:L2783-L2854). citeturn3view0turn3view2

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: R pretrends parity is still deferred, so the R-parity claims remain provisional until the goldens are committed; this is already tracked in project debt and skip-gated in the new tests, so it is informational rather than blocking. Concrete fix: in PR-C, pin the audited pretrends revision, commit benchmarks/data/r_pretrends_golden.json, and enable TestPretrendsParityR (TODO.md:L97-L98, benchmarks/R/generate_pretrends_golden.R:L1-L22, tests/test_methodology_pretrends.py:L920-L940).

Security

No findings.

Documentation/Tests

Severity: P3. Impact: Several touched explanatory texts still lag the shipped PR-B contract: helper tests at tests/test_pretrends.py:L610-L635 still describe custom weights as unsupported from helper APIs, PreTrendsPowerResults.violation_weights is documented as always “normalized” even though fresh linear fits intentionally persist unnormalized |t|, and the changelog still says the methodology test/script are “coming in the next commit” plus describes _extract_pre_period_params as a 5-tuple instead of the current 6-tuple (diff_diff/pretrends.py:L289-L294, CHANGELOG.md:L16-L17, CHANGELOG.md:L29-L30). Concrete fix: update those docstrings/changelog lines to reflect the current helper API, the linear-weight persistence contract, and the extra covariance_source return value.

Verification note: I could not execute the numerical test suite in this environment because the project test dependencies are unavailable; I did verify AST parsing on the touched Python files.

R10 verdict ✅ Looks good with one P3 actionable: four stale explanatory texts still describe pre-PR-B / pre-Step-6 behavior. **1. `PreTrendsPowerResults.violation_weights` docstring** (`pretrends.py:289-294`): said weights are "normalized" without qualification, but PR-B Step 4's linear-with-`relative_times` path intentionally persists the **unnormalized** `|t|` direction so that `δ_pre = M · |t|` and the reported MDV equals Roth's γ exactly. Reworded to enumerate the normalization regime per violation_type: constant / last_period / custom / linear-without-relative_times are L2-normalized; linear-with-relative_times is unnormalized γ-unit. **2. `tests/test_pretrends.py` helper-rejection tests** (`test_compute_pretrends_power_rejects_custom_violation_type` and `test_compute_mdv_rejects_custom_violation_type`, lines 610-635): docstrings said the helper "does not accept ``violation_weights``, so the custom path is unusable from the helper" — but PR-B Step 6 added the kwarg to both helpers. Reworded to scope the rejection contract correctly: it's the unsupplied-`violation_weights` case (loud-fail) that still raises, not the entire custom path. **3. CHANGELOG.md (Added) "Coming in the next commit"** (lines 16-17): both bullets said the test file and R script were "coming in the next commit". Removed — they were committed in PR-B Step 7 (test file) and Step 12 (R script) respectively. **4. CHANGELOG.md (Changed) "5-tuple"** (line 30): described `_extract_pre_period_params` as widening from a 4-tuple to a 5-tuple, but the R3 covariance-source-propagation fix later widened it to a 6-tuple `(effects, ses, vcov, n_pre, relative_times, covariance_source)`. Updated the bullet to describe the current 6-tuple shape and added a forward reference to the MPD pandas.Period / Timestamp / datetime64 coercion path (R8 fix) and the covariance_source provenance contract (R3 fix). The R-parity P3 remains deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). 97 pretrends tests + DR + BR tests pass; no source-logic changes in this commit (docs + comments only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T10:43:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b00782e3a1ae0e79846742725195b15c9668a1e8

Overall Assessment
✅ Looks good

Executive Summary

The prior re-review items on covariance provenance and custom-weight persistence appear resolved: PreTrendsPowerResults now carries covariance_source, pretest_form, and violation_weights, and DiagnosticReport / BusinessReport consume the persisted provenance instead of re-inferring it from result type alone.
I did not find an unmitigated P0/P1 in the touched methodology surfaces: diff_diff/pretrends.py, diff_diff/sun_abraham.py, and diff_diff/diagnostic_report.py.
The NIS implementation, full-Σ_22 CS/SA routing, γ-unit linear-weight change, and legacy downgrade behavior all line up with the updated methodology registry.
One P3 docs/tests issue remains: some touched wording still says pretest_form='wald' reproduces pre-PR-B output, which is no longer true for fitted linear results after the new relative_times weight construction.
R pretrends parity is still deferred, but it is explicitly tracked in TODO.md and the parity tests are skip-gated, so it is informational rather than blocking.
Verification note: this was a source-level review only. I could not execute the touched tests because this environment lacks pytest and the runtime deps (numpy, scipy, pandas).

Methodology
No findings. The changed estimator math and assumptions in diff_diff/pretrends.py / diff_diff/sun_abraham.py match the updated docs/methodology/REGISTRY.md contract for NIS, Wald-as-documented-alternative, full-Σ_22 routing, and γ-unit linear MDV.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt

Severity: P3. Impact: R pretrends parity is still provisional because the committed generator is intentionally a PR-C placeholder and TestPretrendsParityR remains skip-gated. This is already tracked and therefore non-blocking. References: TODO.md:L97-L97, METHODOLOGY_REVIEW.md:L1050-L1072, benchmarks/R/generate_pretrends_golden.R:L1-L223, tests/test_methodology_pretrends.py:L920-L989. Concrete fix: In PR-C, pin the audited R revision, replace the NA_real_ stubs, commit benchmarks/data/r_pretrends_golden.json, and unskip the parity tests.

Security
No findings.

Documentation/Tests

Severity: P3. Impact: The backward-compat wording still overstates what pretest_form='wald' preserves. CHANGELOG.md:L29-L30 and the comments/tests at tests/test_methodology_pretrends.py:L840-L875 say the Wald path reproduces pre-PR-B output, but the fitted linear path now intentionally uses actual relative_times and unnormalized |t| weights at diff_diff/pretrends.py:L867-L891. The authoritative registry already describes that changed weight contract at docs/methodology/REGISTRY.md:L2818-L2820, so the changelog/test wording is now the inconsistent part. Concrete fix: Reword those lines to say Wald preserves the acceptance-region form, and only preserves the old numerics on the legacy relative_times=None path if you want to keep a backwards-compat statement at all.

R11 verdict ✅ with one P3 actionable: the Wald backward-compat wording was overstated. CHANGELOG.md:L29-L30 and the methodology test docstrings at L840-L875 said `pretest_form='wald'` preserves pre-PR-B output bit-identically, but PR-B Step 4's `relative_times` threading changed the linear-weight contract for BOTH NIS and Wald paths — a Wald fit on an irregular grid `{-5, -3, -1}` now produces γ-unit MDV (not the pre-PR-B count-based L2-normalized MDV). Pre-PR-B Wald numerics are bit-identical to post-PR-B Wald output only on (a) the legacy `relative_times=None` callable path (`_get_violation_weights(n_pre)` invoked directly without threading), and (b) regular-grid fits where `|t| ∝ [n_pre-1, ..., 0]`. Fix: - **CHANGELOG.md (Changed) L29**: reword to scope the Wald backwards-compat claim to the **acceptance-region form** (the noncentral-χ² on δ' Σ_22^{-1} δ, unchanged), and add an explicit caveat that the linear-weight contract changed independently in Step 4. Bit-identity holds only on the legacy callable path and the regular-grid case. Removed the stale "tutorial 07 to be re-rendered TBD" qualifier; tracked as a follow-up. - **`tests/test_compute_pretrends_power_accepts_pretest_form_wald`** docstring: clarify that the `'wald'` selector picks the acceptance-region form, not bit-identical numerical output on fitted results. - **`test_wald_path_preserves_pre_pr_b_output`** renamed to `test_wald_path_preserves_pre_pr_b_acceptance_region_form`; docstring expanded with the two-regime backwards-compat scope (legacy callable path: bit-identical; new fit() path: γ-unit MDV on irregular grids). Added a positive NaN-check on `nis_box_probability` to lock the Wald-only field contract. The R-parity P3 remains deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). 30 methodology + 100 baseline pretrends tests pass; no source-logic changes (docs / docstrings only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T10:53:33Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 84e94d9ca793d507ae47b721021756c8f8f790e0

Overall Assessment
⚠️ Needs changes

Executive Summary

The estimator-side methodology changes in diff_diff/pretrends.py and diff_diff/sun_abraham.py mostly line up with the updated registry: NIS is now the default, CS/SA now route through full Σ_22 when available, custom weights are persisted, and covariance provenance is propagated correctly.
The prior P3 wording issue is mostly addressed, but one API example still overstates Wald backward compatibility.
Severity P1 [Newly identified]: after PR-B changed linear MDV to Roth’s γ units, downstream heuristics still treat raw mdv as a level-scale effect. DiagnosticReport/BusinessReport now divide a slope by a level ATT and use it to tier studies as well/moderately/underpowered.
R parity remains intentionally deferred and tracked in TODO.md; that part is non-blocking.
This was a source review only. I did not execute the touched tests here because this workspace does not have the runtime deps (numpy, scipy) installed.

Methodology

Severity: P1 [Newly identified]. Impact: PreTrendsPower now explicitly returns linear MDV in Roth’s γ units when relative_times are available (diff_diff/pretrends.py:L842-L855, docs/methodology/REGISTRY.md:L2818-L2820), but downstream code still compares raw mdv to coefficient-scale quantities. is_informative uses mdv < 2 * max(pre_period_ses) (diff_diff/pretrends.py:L335-L345), sensitivity_to_honest_did() reports mdv / max(SE) (diff_diff/pretrends.py:L1623-L1660), and the report path hard-codes violation_type="linear" then computes mdv_share_of_att = mdv / abs(att) for tiering (diff_diff/diagnostic_report.py:L1416-L1458, docs/methodology/REPORTING.md:L284-L317, user-facing prose at diff_diff/business_report.py:L2158-L2163). That now mixes a slope parameter with level effects/SEs, so the PR can silently over- or under-classify the same fit as well_powered/moderately_powered/underpowered; the new regression in tests/test_business_report.py:L2428-L2475 currently pins that unit-mismatched ratio. Concrete fix: derive a coefficient-scale summary from the fitted violation vector, e.g. max_abs_pre_violation = mdv * max(abs(violation_weights)), and use that instead of raw mdv everywhere the code compares against ATT or pre-period SEs (mdv_share_of_att, is_informative, mdv_in_ses). If the project prefers a different scalar summary, document it explicitly, but stop comparing raw γ to level-scale quantities.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt

Severity: P3. Impact: R pretrends parity is still intentionally deferred. The new generator still contains <PR-C-PIN> and NA_real_ stubs (benchmarks/R/generate_pretrends_golden.R:L1-L40, benchmarks/R/generate_pretrends_golden.R:L105-L124, benchmarks/R/generate_pretrends_golden.R:L185-L223), and the tracker still marks PreTrendsPower as “Complete (R parity pending)” (TODO.md:L97, METHODOLOGY_REVIEW.md:L83, METHODOLOGY_REVIEW.md:L1050-L1072). This is already tracked, so it is non-blocking. Concrete fix: land PR-C as planned: pin the audited R revision, generate benchmarks/data/r_pretrends_golden.json, and unskip TestPretrendsParityR.

Security
No findings.

Documentation/Tests

Severity: P3. Impact: the API docs still say users can “Opt back into the pre-PR-B Wald … form for backwards-compatible numerical output” (docs/api/pretrends.rst:L66-L69). That is still too broad after the γ-unit linear-weight change: fitted irregular-grid linear results can differ from pre-PR-B Wald numerics, as the registry now explains (docs/methodology/REGISTRY.md:L2818-L2820). Concrete fix: reword that example to say Wald preserves the acceptance-region form, and only preserves legacy numerics on the legacy relative_times=None path or regular-grid cases.

Path to Approval

Replace the report/informativeness heuristics that use raw mdv (mdv_share_of_att, is_informative, mdv_in_ses) with a coefficient-scale quantity derived from the fitted violation vector, and add a regression on an irregular-grid linear fit so the ratio/tier cannot regress back to raw γ.
Re-pin the new report-layer regression tests to the corrected scalar; tests/test_business_report.py:L2428-L2475 currently locks in the wrong mdv / |att| contract.
Fix the remaining Wald backward-compat wording in docs/api/pretrends.rst.

R12 CI codex caught a real unit-mismatch bug the doc-rewording trail uncovered. After PR-B Step 4 made linear `mdv` report Roth's γ units (a slope on relative time), downstream code still divided it by level-scale quantities — `DiagnosticReport.pretrends_power` computed `mdv_share_of_att = mdv / abs(att)`, `is_informative` checked `mdv < 2 * max(pre_period_ses)`, and `sensitivity_to_honest_did` reported `mdv_in_ses = mdv / max_pre_se`. On an irregular grid `[-5, -3, -1]`, the level-scale max pre-period violation under the MDV is `mdv * 5`, but the raw `mdv` (γ) is what was being compared to `|att|` / SE — silently mis-tiering the same fit. This is the holistic-fix-on-repeated-doc-findings pattern: each prior round (R5/R9/R10/R11) the codex flagged Wald backwards-compat wording as "overstated" because the LINEAR-WEIGHT change bled into surfaces the doc claimed were unaffected — but the wording wasn't wrong, the IMPLEMENTATION was incomplete. **Holistic fix — new level-scale property `max_abs_pre_violation`** `PreTrendsPowerResults.max_abs_pre_violation` returns `mdv * max(|violation_weights|)` — the largest level-scale pre-period deviation under the MDV. This is the right unit-consistent scalar for comparison against `|att|`, per-period SEs, and HonestDiD's M. - For `linear` with `relative_times=[-T, ..., -1]`, weights = `|t|`, so `max_abs_pre_violation = mdv * T_max`. - For `constant` with normalized `[1, ..., 1]/√K`, weights ~ 1/√K, so `max_abs_pre_violation = mdv / √K`. - For `last_period` with `[0, ..., 0, 1]`, weights have max=1, so `max_abs_pre_violation = mdv`. - For `custom`, depends on user-supplied vector. - Legacy serialized results without `violation_weights` fall back to raw `mdv` (pre-PR-B count-based L2-normalized linear was already roughly level-scale, so the fallback gives the right magnitude). **Wired through 3 surfaces:** 1. `PreTrendsPowerResults.is_informative`: uses `max_abs_pre_violation < 2 * max_se` instead of `mdv < 2 * max_se`. 2. `PreTrendsPower.sensitivity_to_honest_did`: reports `mdv_in_ses = max_abs_pre_violation / max_pre_se`, and surfaces `max_abs_pre_violation` as a new dict key. 3. `DiagnosticReport._check_pretrends_power` and `_format_precomputed_pretrends_power`: `mdv_share_of_att = max_abs_pre_violation / abs(att)`. Schema also surfaces the new `max_abs_pre_violation` field. On `cs_fit` (`base_period='universal'`, seed=7, treatment_effect=1.5): - pre_periods = `[-4, -3, -2]`, `max(|t|) = 4` - `mdv` (γ) = 0.0937, `max_abs_pre_violation` = 0.375, `|att|` = 1.779 - pre-fix `mdv / |att|` = 0.053 (slope/level mismatch) - post-fix `max_abs_pre_violation / |att|` = 0.211 (level/level) - Tier: still `well_powered` (0.211 < 0.25), now interpretable. **Documentation updated** to match: - `docs/methodology/REPORTING.md` "Power-aware phrasing" Note: ratio definition changed from `mdv / abs(att)` to `max_abs_pre_violation / abs(att)`; rationale added inline. - `docs/api/pretrends.rst` Wald example: rewrote the "backwards- compatible numerical output" wording to scope bit-identity to regular grids / legacy `relative_times=None` path. PR-B Step 4's `relative_times` threading applies to BOTH NIS and Wald, so a Wald fit on an irregular grid also produces γ-unit MDV — not pre-PR-B numerics. - `CHANGELOG.md` (Fixed): new entry documenting the fix and the numerical-output change for downstream consumers. **New regressions:** - `test_max_abs_pre_violation_uses_weight_scale_on_irregular_grid`: constructs an irregular-grid `[-5, -3, -1]` fit and asserts `max_abs_pre_violation = mdv * 5`, with a guard that the value is > 4x the raw mdv to prevent any future revert. - `test_is_informative_uses_level_scale_not_raw_gamma`: constructs a fit where raw `mdv < 2*SE` (would say "informative") but `max_abs_pre_violation > 2*SE` (says "not informative"); asserts the level-scale check wins. - Updated `test_full_vcov_path_no_downgrade_on_real_cs_fit` (BR level): now pins `0.35 < block["max_abs_pre_violation"] < 0.40` and asserts `mdv_share_of_att < 0.25` against the new level-scale definition. 590 tests pass across pretrends + DR + BR + SA + staggered. 4 skipped (R-parity stubs + 1 fixture skip). No regressions outside the targeted ratio definition change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T11:09:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b053faac637405b59b2886e74a6b114b14a8a0bd

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior rerun P1 on unit-mismatched linear MDV tiering looks resolved across the estimator, report layer, docs, and regressions (diff_diff/pretrends.py:L335-L385, diff_diff/pretrends.py:L1663-L1714, diff_diff/diagnostic_report.py:L1428-L1527, docs/methodology/REPORTING.md:L301-L309).
The PR’s main methodology changes otherwise line up with the updated registry: NIS is the default, CS/SA now route through full Σ_22 when available, custom weights are persisted, and covariance provenance is carried through to reporting (diff_diff/pretrends.py:L1116-L1171, diff_diff/sun_abraham.py:L782-L810, diff_diff/diagnostic_report.py:L1451-L1614).
Severity: P1 [Newly identified]. violation_type="constant" is now documented as δ_t = c / δ_t = M in the public methodology/docs, but the implementation still L2-normalizes the constant direction. That means each pre-period shift is actually M / √K, not M, so constant-pattern power/MDV remain on a different scale than the PR now claims.
R parity is still intentionally deferred and tracked; the generator and parity tests are PR-C stubs, which is non-blocking because the TODO tracking is explicit (TODO.md:L97, benchmarks/R/generate_pretrends_golden.R:L16-L24, tests/test_methodology_pretrends.py:L1057-L1126).
I did not execute the touched tests here; this workspace is missing runtime deps (numpy, scipy).

Methodology

Severity: P1 [Newly identified]. Impact: the PR now states that the constant violation is a per-period level shift, δ_t = c / δ_t = M, in docs/methodology/REGISTRY.md:L2812-L2816 and docs/api/pretrends.rst:L132-L135, but the implementation still normalizes constant weights to unit norm in diff_diff/pretrends.py:L932-L947, and power_at() preserves the same normalized contract in diff_diff/pretrends.py:L562-L579. The in-code docstring even reflects the normalized interpretation (diff_diff/pretrends.py:L370-L373). So for K pre-periods, the library currently evaluates δ_t = M / √K, not δ_t = M. That is an undocumented methodology mismatch on a public estimator surface. Concrete fix: either stop normalizing the constant direction so M truly is the per-period level shift, or add an explicit REGISTRY/API note that M is the magnitude along the normalized constant direction and update the public wording away from δ_t = M. Add an end-to-end regression that locks the chosen constant-pattern contract.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: R-package parity remains deferred. The committed generator still contains <PR-C-PIN> / NA_real_ placeholders (benchmarks/R/generate_pretrends_golden.R:L16-L24, benchmarks/R/generate_pretrends_golden.R:L105-L123, benchmarks/R/generate_pretrends_golden.R:L189-L223), and the parity tests are still skip-gated stubs (tests/test_methodology_pretrends.py:L1057-L1126). This is already tracked in TODO.md:L97 and METHODOLOGY_REVIEW.md:L1050-L1072, so it is non-blocking. Concrete fix: land PR-C as planned.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the reporting math was corrected to use max_abs_pre_violation / |ATT|, but BusinessReport still labels that number as MDV / |ATT| in the rendered report (diff_diff/business_report.py:L2469-L2474), while the new definition lives in diff_diff/diagnostic_report.py:L1428-L1468 and docs/methodology/REPORTING.md:L301-L309. The numeric tier is right; the label is stale. Concrete fix: rename the rendered label to match the new numerator, or surface both raw mdv and max_abs_pre_violation explicitly.

Path to Approval

Resolve the constant-pattern contract mismatch: either remove L2 normalization for violation_type="constant" in PreTrendsPower, or document the normalized-direction contract explicitly in REGISTRY.md and docs/api/pretrends.rst so the public method definition matches the shipped code.
Add one regression that locks the chosen constant-pattern semantics end-to-end in fit() and power_at() for K > 1, so future audits cannot drift between “per-period shift” and “normalized-direction magnitude.”

R13 CI codex caught the next adjacent unit-mismatch — analogous to R12's linear/level fix but for `violation_type='constant'`. REGISTRY `## PreTrendsPower` documents the pattern as `δ_t = c` (per-period level shift), but `_get_violation_weights('constant')` still L2-normalized to `[1/√K, ..., 1/√K]`, so `δ_t = M/√K` not `δ_t = M`. The Step 4 γ-unit fix for linear and the R12 holistic `max_abs_pre_violation` fix uncovered this — once `mdv` is the documented magnitude of a per-period shift on linear, the constant pattern's silent `1/√K` re-scaling becomes visible and breaks the documented contract. **Holistic fix — constant/last_period skip L2 normalization** - `_get_violation_weights('constant')`: early return `np.ones(n_pre)`, no L2 norm. Now `δ_t = M` exactly per period — matching the REGISTRY/API documented contract. - `_get_violation_weights('last_period')`: also given an explicit early return for symmetry. The `[0, ..., 0, 1]` vector already had L2 norm 1, so this is a no-op numerically; the early return locks the contract uniformly across level-pattern violation types. - `power_at()` legacy reconstruction (fallback for old serialized results without `violation_weights`): unchanged — still L2-normalizes, preserving pre-PR-B numerical output for legacy serialized fits per the same backwards-compat policy applied to the linear-legacy and constant-legacy paths. - Docstring on `_get_violation_weights` rewritten to enumerate the per-type normalization convention explicitly: linear-with-times (γ-unit), linear-legacy (L2-norm), constant (level), last_period (level), custom (L2-norm). **End-to-end regression** in `test_methodology_pretrends.py`: - `test_constant_violation_pattern_is_level_shift`: real `SunAbraham`-fit results, asserts `violation_weights == [1, ..., 1]` (NOT L2-normalized → `||w||_2 = √K`), `max_abs_pre_violation == mdv` (level-scale and γ-scale coincide for constant), and `power_at(M) ≈ refit(M=M).power` at `atol=1e-4` so the level-shift contract holds end-to-end through `fit()` and `power_at()`. Codex specifically requested an end-to-end lock so future audits cannot drift between "per-period shift" and "normalized-direction magnitude". - `test_constant_weights` in `tests/test_pretrends.py` flipped: was pinning `np.linalg.norm(weights) == 1.0` (the OLD L2-normalized contract); now asserts unnormalized `[1, 1, 1, 1]` with `||w||_2 = 2` (√4). Docstring explains the contract change. **P3 fix — BR rendered label** `BusinessReport.full_report()` still labeled the `mdv_share_of_att` row as `MDV / |ATT|`, but R12 redefined the numerator as `max_abs_pre_violation`. Fix: rename the rendered label to "Max pre-period level deviation / |ATT|" and add an explicit row for `Max pre-period level deviation at MDV` above it so users can see both the raw `mdv` and the level-scale scalar. **Behavior change for users:** any caller passing `violation_type='constant'` will see a √K-factor change in the reported `mdv` and downstream `mdv_share_of_att`. The shift is documented in REGISTRY `## PreTrendsPower` (the pattern was `δ_t = c` all along — the IMPLEMENTATION is now what the docs have always said). 445 tests pass across pretrends + DR + BR + SA. 4 skipped (R-parity + 1 fixture skip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T11:21:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 049f04ced170d3788c9dbccf83115cd17ab295b7

Overall Assessment

✅ Looks good

Executive Summary

The prior rerun P1 on the constant-pattern contract appears resolved across the estimator, registry, API docs, and regressions: constant violations are now treated as a true level shift rather than an L2-normalized direction. diff_diff/pretrends.py:L951-L959, docs/methodology/REGISTRY.md:L2812-L2820, tests/test_methodology_pretrends.py:L527-L563
I did not find any new unmitigated P0/P1 methodology issues in the estimator changes. The NIS/Wald split, full-Σ_22 routing for non-bootstrap CS/SA, γ-unit linear MDV path, and custom-weight persistence all line up with the updated registry.
Severity P2: the new max_abs_pre_violation metric is computed in DiagnosticReport and expected by the BR full-report renderer, but _lift_pre_trends() never carries it into the BusinessReport schema. That means the new raw level-scale metric is still dropped on the BR path. diff_diff/diagnostic_report.py:L1466-L1477, diff_diff/business_report.py:L924-L931, diff_diff/business_report.py:L2469-L2481
Severity P3: several narrative/doc surfaces still describe the tier numerator as raw “MDV” even though the code now uses max_abs_pre_violation / |ATT|; a couple of in-code docs also still describe the old constant-normalization contract. diff_diff/business_report.py:L2157-L2167, diff_diff/diagnostic_report.py:L3279-L3286, diff_diff/pretrends.py:L289-L300, diff_diff/pretrends.py:L370-L373
R parity is still intentionally deferred and explicitly tracked, so that remains non-blocking. TODO.md:L97-L98, benchmarks/R/generate_pretrends_golden.R:L1-L24, tests/test_methodology_pretrends.py:L1095-L1135
I could not run the touched tests here because this workspace is missing pytest and numpy.

Methodology

No findings. The estimator-side changes now match the updated REGISTRY.md contract closely enough that I do not see an unmitigated P1+ methodology defect in the touched paths.

Code Quality

Severity P2. Impact: DiagnosticReport now computes and surfaces max_abs_pre_violation, and _render_full_report() is prepared to print it, but BusinessReport._lift_pre_trends() drops the field entirely. As written, BR schema consumers and full_report() still cannot see the new raw level-scale metric even though the PR added renderer logic for it. diff_diff/diagnostic_report.py:L1466-L1477, diff_diff/business_report.py:L924-L931, diff_diff/business_report.py:L2469-L2481
Concrete fix: add "max_abs_pre_violation": pp.get("max_abs_pre_violation") in _lift_pre_trends(), then add an end-to-end BR regression on both to_dict()["pre_trends"] and full_report().

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: R-package parity is still deferred; the generator is intentionally placeholder-only and the parity suite is skip-gated until PR-C lands. This is already tracked in TODO.md, so it is non-blocking. TODO.md:L97-L98, benchmarks/R/generate_pretrends_golden.R:L1-L24, tests/test_methodology_pretrends.py:L1095-L1135
Concrete fix: land PR-C with a pinned pretrends revision, committed JSON goldens, and activated parity assertions.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the numeric tiering now uses max_abs_pre_violation / |ATT|, but some user-facing prose and in-code docs still say “MDV” is what is being compared to the ATT, and the autosummary page still omits the new property. That leaves the reporting explanation partially out of sync with the actual metric, especially on irregular-grid linear fits where raw MDV is in γ units. diff_diff/business_report.py:L2157-L2167, diff_diff/diagnostic_report.py:L3279-L3286, diff_diff/pretrends.py:L289-L300, diff_diff/pretrends.py:L370-L373, docs/api/_autosummary/diff_diff.PreTrendsPowerResults.rst:L24-L47
Concrete fix: rename the summary prose to reference the “max pre-period level deviation at the MDV,” update the stale constant-weight docstrings, add ~PreTrendsPowerResults.max_abs_pre_violation to the autosummary, and extend the report tests to assert the renamed metric end-to-end.

R14 verdict ✅ Looks good with one P2 BR-schema-gap + a P3 fan-out of stale "MDV vs ATT" prose that the R12 level-scale fix needed to sweep through but missed. **P2 — `_lift_pre_trends` dropped `max_abs_pre_violation`** R12 added `max_abs_pre_violation` to the `DiagnosticReport` schema and wired the `BusinessReport.full_report()` renderer to print it, but the BR schema-lift helper at `business_report.py:884` did NOT carry the field across the DR → BR boundary. Net effect: `BR.to_dict()["pre_trends"]` was missing the field AND `full_report()`'s `pt.get("max_abs_pre_violation")` returned None, so the new "Max pre-period level deviation at MDV" line never actually rendered. Fix: add `"max_abs_pre_violation": pp.get("max_abs_pre_violation")` to the `_lift_pre_trends` return dict. New BR end-to-end regression asserts both `BR.to_dict()["pre_trends"]["max_abs_pre_violation"]` is populated AND `full_report()` contains the rendered line. **P3 — Stale "MDV / |ATT|" prose in 4 surfaces** R12 moved the tier numerator from raw `mdv` to `max_abs_pre_violation` but several user-facing prose surfaces still said the comparison was between "MDV" and "estimated effect" — wording lag, not a behavioral bug. 1. `business_report.py:2167` "the test is well-powered" summary sentence: reworded to say "the max pre-period level deviation at the MDV is small relative to the estimated effect" rather than the bare "minimum-detectable violation is small". 2. `diagnostic_report.py:3284` DR "no_detected_violation / well_powered" sentence: same swap from "MDV is a small share of the estimated effect" to "the max pre-period level deviation at the MDV is a small share". 3. `PreTrendsPowerResults.violation_weights` docstring: reworded to enumerate per-violation_type normalization explicitly (linear with-times γ-unit; linear legacy L2-norm; constant unnormalized level-shift; last_period level; custom L2-norm). 4. `PreTrendsPowerResults.max_abs_pre_violation` property docstring (the non-linear-types paragraph): updated to reflect the R13 constant-level-shift change (`mdv * 1 = mdv` rather than the old `mdv / √K`). Plus the autosummary RST adds the new `~PreTrendsPowerResults.max_abs_pre_violation` property entry so the published API page lists it. R-parity P3 deferred to PR-C per the existing TODO row. 591 tests pass; no code-path regressions. The new BR regression catches the lift-boundary bug if it ever regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T11:32:33Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 853e52336e44992b5730bbc476004a905077da0a

Overall Assessment

✅ Looks good

Executive Summary

The prior rerun P2 is resolved: BusinessReport._lift_pre_trends() now carries max_abs_pre_violation, and the BR full report renders the level-scale metric instead of dropping it at the lift boundary. diff_diff/business_report.py:L884-L939, diff_diff/business_report.py:L2477-L2489
The prior rerun P3 doc drift is resolved on the changed surfaces: reporting prose, PreTrendsPowerResults docs, and autosummary now match the level-scale ratio contract. diff_diff/diagnostic_report.py:L3279-L3287, diff_diff/business_report.py:L2164-L2170, docs/api/_autosummary/diff_diff.PreTrendsPowerResults.rst:L24-L48
On methodology, the changed estimator paths line up with the updated registry: NIS is the documented default, Wald remains the documented alternative, non-bootstrap CS/SA now route full Σ_22, linear MDV uses Roth’s γ units when relative times are available, and custom weights persist through power_at(). diff_diff/pretrends.py:L39-L237, diff_diff/pretrends.py:L876-L1595, diff_diff/sun_abraham.py:L782-L810, docs/methodology/REGISTRY.md:L2773-L2854
The covariance-provenance fix is correctly persisted at fit time and honored by both live and precomputed report paths, with the legacy conservative fallback still preserved for ambiguous old serialized objects. diff_diff/pretrends.py:L1576-L1594, diff_diff/diagnostic_report.py:L1451-L1534, diff_diff/diagnostic_report.py:L1537-L1614
Severity P3 only: R-package parity is still intentionally deferred to PR-C and is properly tracked, so it is non-blocking. TODO.md:L97-L98, METHODOLOGY_REVIEW.md:L1050-L1072, benchmarks/R/generate_pretrends_golden.R:L1-L24
I could not execute the touched tests here because this environment does not have pytest, numpy, scipy, or pandas.

Methodology

No findings. The changed implementation matches the updated Methodology Registry on the load-bearing points: NIS box probability, Wald alternative, full-Σ_22 routing, γ-unit linear weights, and custom-weight persistence. docs/methodology/REGISTRY.md:L2775-L2854, diff_diff/pretrends.py:L1225-L1595

Code Quality

No findings. The previous BR lift-boundary regression is fixed and covered by an end-to-end CS regression. diff_diff/business_report.py:L884-L939, tests/test_business_report.py:L2425-L2515

Performance

No findings.

Maintainability

No findings. Persisting covariance_source on PreTrendsPowerResults removes the old type-based guesswork on new fits and confines fallback inference to true legacy cases. diff_diff/pretrends.py:L323-L333, diff_diff/pretrends.py:L1576-L1594, diff_diff/diagnostic_report.py:L1511-L1614

Tech Debt

Severity P3. Impact: R pretrends parity is still deferred; the committed generator is intentionally a placeholder with NA_real_ stubs and an unresolved <PR-C-PIN>, so the new parity class remains skip-gated. This is already tracked in TODO.md / METHODOLOGY_REVIEW.md, so it is non-blocking. Concrete fix: land PR-C with a pinned pretrends revision, committed benchmarks/data/r_pretrends_golden.json, and activated tests/test_methodology_pretrends.py::TestPretrendsParityR. benchmarks/R/generate_pretrends_golden.R:L1-L24, benchmarks/R/generate_pretrends_golden.R:L185-L223, TODO.md:L97-L98, METHODOLOGY_REVIEW.md:L1050-L1072

Security

No findings.

Documentation/Tests

No findings in the changed surfaces. The report prose, autosummary, API docs, and new regression coverage are aligned with the new level-scale metric and covariance-provenance contract. docs/methodology/REPORTING.md:L284-L365, docs/api/pretrends.rst:L57-L143, tests/test_diagnostic_report.py:L380-L591, tests/test_methodology_pretrends.py:L882-L1008
Validation note: I could not run the touched tests here because pytest, numpy, scipy, and pandas are unavailable.

igerber and others added 14 commits May 18, 2026 20:38

igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreTrendsPower implementation audit (Roth 2022 — PR-B)#466

PreTrendsPower implementation audit (Roth 2022 — PR-B)#466
igerber wants to merge 21 commits into
mainfrom
feature/pretrends-implementation-audit

igerber commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 19, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant