Merge pull request #463 from igerber/feature/roth-2022-paper-review

igerber · web-flow · commit f6499db51f7b · 2026-05-17T12:55:22.000-04:00
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -1053,11 +1053,11 @@ and covariate-adjusted specifications.)
 **Documentation in place:**
 - REGISTRY.md section: `## PreTrendsPower` (MDV at target power, four violation types — linear/constant/last_period/custom, power curve plotting, HonestDiD integration)
 - Implementation: `tests/test_pretrends.py` (point-estimator, MDV, power curve, sensitivity) plus event-study coverage in `tests/test_pretrends_event_study.py`
+- Paper review on file: `docs/methodology/papers/roth-2022-review.md` (added 2026-05-17; non-authoritative source audit — registry entry remains authoritative until the follow-up audit PR)
 
 **Outstanding for promotion:**
-- Paper review under `docs/methodology/papers/roth-2022-review.md`
 - Dedicated `tests/test_methodology_pretrends.py` with paper-equation-numbered Verified Components walk-through
-- R parity fixture against the `pretrends` R package (the four power calculations: linear, constant, last-period, custom)
+- R parity fixture against the `pretrends` R package at a **pinned revision** (TODO.md tracks the revision-pin follow-up; until that lands, the R-package surface claims in `docs/methodology/papers/roth-2022-review.md` are provisional). Covers the four power calculations: linear, constant, last-period, custom. Note that `compute_pretrends_power` does not accept `violation_weights` today, so `"custom"` parity has to run through `PreTrendsPower(..., violation_weights=...)` directly until the helper is extended (TODO.md tracks the helper-extension follow-up); helper-only parity is limited to `linear` / `constant` / `last_period`.
 - Verify the REGISTRY Implementation Checklist (all four items currently unchecked)
 
 ---
diff --git a/TODO.md b/TODO.md
@@ -94,6 +94,11 @@ Deferred items from PR reviews that were not addressed before merge.
 | WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional `weights="cohort_share"` parameter to `aggregate()`. | `wooldridge_results.py` | #216 | Medium |
 | WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low |
 | WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium |
+| PreTrendsPower: `compute_pretrends_power` adapter uses `diag(ses^2)` instead of the full pre-period covariance block Σ_22 for `CallawaySantAnnaResults` (deliberate — non-bootstrap CS persists `event_study_vcov`; bootstrap CS fits clear it at `staggered.py:2032-2036`) and `SunAbrahamResults` (forced — SA does not expose an event-study/cohort VCV at all). Roth (2022)'s NIS box probability and the library's Wald object both depend on Σ_22 off-diagonals; diag fallback is not provably conservative. For non-bootstrap CS fits, route through `event_study_vcov`; for bootstrap CS fits the diag fallback is the only path. For SA, extend `SunAbrahamResults` to persist a cohort/event-study VCV (then route the adapter likewise). Or formally retain the diag fallback with explicit miscalibration framing. See REGISTRY.md `## PreTrendsPower` Note (deviation from paper) + `docs/methodology/papers/roth-2022-review.md`. | `diff_diff/pretrends.py:609-687`, `diff_diff/sun_abraham.py:30-88`, `docs/methodology/REGISTRY.md`, `docs/methodology/papers/roth-2022-review.md` | PR-A (Roth paper review, 2026-05-17) | Medium |
+| PreTrendsPower: pin the R `pretrends` package commit/release before building the R-parity fixture. The paper review's R-package surface claims (`pretrends()`, `slope_for_power()`, NIS-only API, no joint-Wald target) are provisional pending a pinned revision; the audited revision should be recorded either in the review file's Gaps section or in this TODO row before any parity assertions are committed. | `docs/methodology/papers/roth-2022-review.md`, `METHODOLOGY_REVIEW.md` (PreTrendsPower row) | PR-A (Roth paper review, 2026-05-17) | Low |
+| PreTrendsPower: helper `compute_pretrends_power(results, M, alpha, target_power, violation_type, pre_periods)` does NOT accept `violation_weights`, so `violation_type="custom"` is unusable from the helper (class-only today via `PreTrendsPower(..., violation_weights=...)`). Either add `violation_weights` to the helper signature and forward to the class, or document the helper as supporting only `linear` / `constant` / `last_period`. | `diff_diff/pretrends.py:1048-1095, 442-466` | PR-A (Roth paper review, 2026-05-17) | Low |
+| PreTrendsPower: `PreTrendsPowerResults.power_at()` does not yet support `violation_type="custom"`. **Silent-failure path was mitigated** in PR-A (2026-05-17, R18 of the codex review): `power_at()` now raises `NotImplementedError` for custom fits rather than returning equal-weights output, locked in by `test_power_at_raises_on_custom_violation_type`. Remaining follow-up: persist the normalized fitted `violation_weights` on `PreTrendsPowerResults` (currently absent at `pretrends.py:77-90`) and re-enable `power_at()` for custom fits, with a parity test comparing `results.power_at(M)` to a fresh `PreTrendsPower(...).fit(..., M=M).power` on a custom-weights fixture. | `diff_diff/pretrends.py:77-90, ~196-235, ~878-892` | PR-A (Roth paper review, 2026-05-17) | Medium |
+| PreTrendsPower: `linear` violation pattern does NOT implement Roth's δ_t = γ·t. `_get_violation_weights(violation_type="linear")` constructs a shifted, normalized `[n-1, ..., 1, 0]` direction from `n_pre` only (`pretrends.py:510-515`), and `fit()` never threads actual relative-time labels into that construction (`pretrends.py:862-866`). For irregular pre-period grids (e.g., anticipation-shifted `t ∈ {-5, -3, -1}`) this means the slope reported as MDV is not in Roth's γ units. Fix: build linear weights from the sorted actual relative-time values used in the fit, define the exposed parameter in γ units, persist any normalization separately, and add a regression test using anticipation-shifted / irregular pre-periods. If the shifted convention is intentional, add a `**Note (deviation from paper):**` to REGISTRY.md and convert reported MDV back to Roth's slope scale before exposing it. | `diff_diff/pretrends.py:488-531, 862-866`, `docs/methodology/REGISTRY.md:2786-2789` | PR-A (Roth paper review, 2026-05-17; surfaced by R17 of the iterative codex review on the paper review file) | **High** |
 | Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium |
 | Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
 | HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits — REMAINING sub-gate: `TwoWayFixedEffects` (`twfe.py:154` rejects unconditionally). The DiD sub-gate and the MultiPeriodDiD sub-gate were both lifted via auto-route to `fixed_effects=` internally (DiD: PR #458, ~1e-10 vs clubSandwich; MPD: this release, ~1e-10 vs sandwich::vcovHC and clubSandwich::vcovCR). TWFE has no equivalent `fixed_effects=` code path (always within-transforms), so the same auto-route surgery is not directly applicable — lifting requires either building the full-dummy design inline or refactoring TWFE to delegate to DiD. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix; HC1/CR1 are unaffected (no leverage term). | `twfe.py::fit` | follow-up | Medium |
diff --git a/diff_diff/pretrends.py b/diff_diff/pretrends.py
@@ -209,9 +209,31 @@ def power_at(self, M: float) -> float:
         -------
         float
             Power to detect violation of magnitude M.
+
+        Raises
+        ------
+        NotImplementedError
+            If the fit was made with ``violation_type="custom"``. The
+            ``PreTrendsPowerResults`` dataclass does not currently persist
+            the fitted ``violation_weights``, so this method cannot
+            reconstruct the custom weights. Refit
+            ``PreTrendsPower(violation_type="custom", violation_weights=...)``
+            with the new ``M`` instead. Tracked in TODO.md as a planned
+            follow-up to persist the fitted weights.
         """
         from scipy import stats
 
+        if self.violation_type == "custom":
+            raise NotImplementedError(
+                "PreTrendsPowerResults.power_at() does not support "
+                "violation_type='custom': fitted violation_weights are "
+                "not persisted on the result object, so the custom weights "
+                "cannot be reconstructed. Refit "
+                "PreTrendsPower(violation_type='custom', "
+                "violation_weights=...) with the new M instead. "
+                "See TODO.md (PreTrendsPower power_at custom path)."
+            )
+
         n_pre = self.n_pre_periods
 
         # Reconstruct violation weights based on violation type
@@ -227,8 +249,14 @@ def power_at(self, M: float) -> float:
             weights = np.zeros(n_pre)
             weights[-1] = 1.0
         else:
-            # For custom, we can't reconstruct - use equal weights as fallback
-            weights = np.ones(n_pre)
+            # Fail loud on unknown violation_type values. Mirrors the raise
+            # at the end of _get_violation_weights(); prevents silent
+            # equal-weights output if a future violation_type is added to
+            # fit() but not threaded through power_at().
+            raise ValueError(
+                f"Unknown violation_type: {self.violation_type!r}. "
+                f"Expected one of: 'linear', 'constant', 'last_period', 'custom'."
+            )
 
         # Normalize weights to unit L2 norm
         norm = np.linalg.norm(weights)
@@ -1067,7 +1095,18 @@ def compute_pretrends_power(
     target_power : float, default=0.80
         Target power for MDV calculation.
     violation_type : str, default='linear'
-        Type of violation pattern.
+        Type of violation pattern. This convenience helper supports
+        ``linear`` / ``constant`` / ``last_period`` only and does NOT
+        accept ``violation_weights``, so passing
+        ``violation_type='custom'`` will raise ``ValueError`` from the
+        underlying ``PreTrendsPower`` constructor (which requires
+        ``violation_weights`` when ``violation_type='custom'``). To use a
+        custom violation pattern, instantiate ``PreTrendsPower(...,
+        violation_weights=...)`` directly. Note that
+        ``PreTrendsPowerResults.power_at()`` on such a fit raises
+        ``NotImplementedError`` because fitted weights are not yet
+        persisted on the result object; refit with the new ``M`` instead.
+        Both gaps are tracked in TODO.md until the follow-up audit lands.
     pre_periods : list of int, optional
         Explicit list of pre-treatment periods. If None, attempts to infer
         from results. Use when you've estimated all periods as post_periods.
@@ -1114,7 +1153,18 @@ def compute_mdv(
     target_power : float, default=0.80
         Target power for MDV calculation.
     violation_type : str, default='linear'
-        Type of violation pattern.
+        Type of violation pattern. This convenience helper supports
+        ``linear`` / ``constant`` / ``last_period`` only and does NOT
+        accept ``violation_weights``, so passing
+        ``violation_type='custom'`` will raise ``ValueError`` from the
+        underlying ``PreTrendsPower`` constructor (which requires
+        ``violation_weights`` when ``violation_type='custom'``). To use a
+        custom violation pattern, instantiate ``PreTrendsPower(...,
+        violation_weights=...)`` directly. Note that
+        ``PreTrendsPowerResults.power_at()`` on such a fit raises
+        ``NotImplementedError`` because fitted weights are not yet
+        persisted on the result object; refit with the new ``M`` instead.
+        Both gaps are tracked in TODO.md until the follow-up audit lands.
     pre_periods : list of int, optional
         Explicit list of pre-treatment periods. If None, attempts to infer
         from results. Use when you've estimated all periods as post_periods.
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2761,7 +2761,7 @@ CRITICAL: δ_pre = β_pre pins pre-treatment violations to observed coefficients
 
 ## PreTrendsPower
 
-**Primary source:** [Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. *American Economic Review: Insights*, 4(3), 305-322.](https://doi.org/10.1257/aeri.20210236)
+**Primary source:** [Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. *American Economic Review: Insights*, 4(3), 305-322.](https://doi.org/10.1257/aeri.20210236). Paper review on file: `docs/methodology/papers/roth-2022-review.md` (non-authoritative source audit; this REGISTRY entry remains the authoritative methodology contract).
 
 **Key implementation requirements:**
 
@@ -2793,6 +2793,10 @@ Violation types:
 - **Last period**: δ_{-1} = c, others zero
 - **Custom**: user-specified pattern
 
+- **Note (deviation from paper — `linear` violation pattern):** the shipped `PreTrendsPower._get_violation_weights("linear")` constructs `[n_pre-1, ..., 1, 0]` from `n_pre` alone and `PreTrendsPower.fit()` never threads the actual relative-time labels into that construction (`pretrends.py:488-531`, `pretrends.py:862-866`). For irregular or anticipation-shifted pre-period grids (e.g., `t ∈ {-5, -3, -1}`), this means the slope reported as MDV is NOT in Roth's `γ` units — the shifted/normalized direction effectively assumes contiguous relative times `{-(n_pre-1), ..., -1}`. The follow-up audit (tracked in TODO.md) will either rebuild `linear` weights from the sorted actual relative-time values and expose the parameter in Roth's `γ` units, or formally retain the current shifted/normalized contract with this Note as the deviation record.
+
+- **Note (silent-failure guard — `power_at()` with `violation_type="custom"`):** `PreTrendsPowerResults` does not currently persist the fitted `violation_weights`, so `power_at(M)` cannot reconstruct the custom direction. As of this commit, `PreTrendsPowerResults.power_at()` raises `NotImplementedError` for `violation_type="custom"` rather than silently returning equal-weights output. To compute power at a new `M` for a custom fit, refit `PreTrendsPower(violation_type="custom", violation_weights=...)` with the new `M`. Tracked in TODO.md as a planned follow-up to persist the fitted weights and lift the guard.
+
 *Standard errors:*
 - Power calculations are exact (no sampling variability)
 - Uncertainty comes from estimated Σ
@@ -2802,6 +2806,13 @@ Violation types:
 - Single pre-period: power calculation trivial
 - Very high power: MDV approaches zero
 
+- **Note (deviation from paper — diagonal pre-period VCV fallback):** Roth (2022)'s power and bias objects (both the paper-analyzed NIS box probability and the library's Wald / noncentral-χ² form) operate on the full pre-period covariance block Σ_22. The shipped `compute_pretrends_power` adapter currently uses different sources for the pre-period covariance by result type:
+  - `MultiPeriodDiDResults` (`pretrends.py:592-601`): extracts the full pre-period sub-block from `results.vcov` when `interaction_indices` is populated; falls back to `diag(ses^2)` otherwise.
+  - `CallawaySantAnnaResults` (`pretrends.py:609-652`): hard-codes `vcov = diag(ses^2)`. Non-bootstrap CS fits persist a full `event_study_vcov` matrix (`staggered_results.py:126-128`), so the diag fallback is a deliberate choice in that path. Bootstrap CS fits clear `event_study_vcov` before storing results (`staggered.py:2032-2036`) to prevent mixing analytical VCV with bootstrap SEs, so the full-Σ22 route is not available for bootstrap fits at all.
+  - `SunAbrahamResults` (`pretrends.py:660-687`): hard-codes `vcov = diag(ses^2)`; the diag fallback is *forced* because `SunAbrahamResults` does not currently expose an event-study or cohort covariance matrix.
+
+  Dropping the off-diagonals is NOT a paper-supported numerical choice and is NOT guaranteed to be conservative for MDV/power (the direction of the discrepancy depends on the sign and magnitude of the dropped correlations). The PR-B follow-up audit (tracked in `TODO.md`) will either extend full-sub-VCV consumption to all three paths (with SA also requiring upstream surface work on `SunAbrahamResults`) or formally retain the diag fallback with explicit miscalibration framing. See `docs/methodology/papers/roth-2022-review.md` for the full derivation.
+
 **Reference implementation(s):**
 - R: `pretrends` package (Roth's official package)
 
diff --git a/docs/methodology/REPORTING.md b/docs/methodology/REPORTING.md
@@ -328,13 +328,20 @@ a library setting.
   `DiagnosticReport.pretrends_power` block records
   `covariance_source: "diag_fallback_available_full_vcov_unused"` in
   that case, and `BusinessReport` downgrades a `well_powered` tier to
-  `moderately_powered` before rendering prose. This is a known
-  conservative deviation from the documented "use the full pre-period
-  covariance" position — it prevents the diagonal approximation from
-  producing an overly optimistic "well-powered" claim when correlated
-  pre-period errors could tighten the MDV. The right long-term fix is
-  to teach `compute_pretrends_power()` to consume `event_study_vcov`
-  and `event_study_vcov_index`; until that lands this downgrade stays.
+  `moderately_powered` before rendering prose. This is a documented
+  deviation from the paper-derived "use the full pre-period covariance"
+  position. **Not provably conservative**: under Roth (2022)'s NIS
+  framework and the library's Wald form, the MDV/power objects depend
+  on the off-diagonals of Σ_22, and the direction of the discrepancy
+  between full-Σ_22 and diag(ses^2) depends on the sign and magnitude
+  of the dropped correlations — see the `**Note (deviation from paper
+  — diagonal pre-period VCV fallback):**` block under `## PreTrendsPower`
+  in `docs/methodology/REGISTRY.md`. The `well_powered → moderately_powered`
+  downgrade in BusinessReport reduces the chance of an overly optimistic
+  claim in practice, but it is not a proof of conservatism. The right
+  long-term fix is to teach `compute_pretrends_power()` to consume
+  `event_study_vcov` and `event_study_vcov_index`; until that lands the
+  downgrade stays.
 
 - **Note:** Unit-translation policy. BusinessReport does not
   arithmetically translate log-points to percents or level effects to
diff --git a/docs/methodology/papers/roth-2022-review.md b/docs/methodology/papers/roth-2022-review.md
diff --git a/tests/test_pretrends.py b/tests/test_pretrends.py