Skip to content

Add Staggered Triple Difference estimator#245

Open
igerber wants to merge 2 commits intomainfrom
staggered-ddd
Open

Add Staggered Triple Difference estimator#245
igerber wants to merge 2 commits intomainfrom
staggered-ddd

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 29, 2026

Summary

  • Implement StaggeredTripleDifference estimator for staggered adoption DDD designs (Ortiz-Villavicencio & Sant'Anna 2025)
  • Group-time ATT(g,t) via three-DiD decomposition with DR, IPW, and regression adjustment methods
  • GMM-optimal combination across comparison cohorts with closed-form inverse-variance weights
  • Event study, group, and overall aggregation via CallawaySantAnna mixin reuse
  • Multiplier bootstrap with simultaneous confidence bands
  • control_group parameter supporting both "nevertreated" and "notyettreated" modes

Methodology references (required if estimator / math changes)

  • Method name(s): StaggeredTripleDifference (Staggered DDD)
  • Paper / source link(s): Ortiz-Villavicencio & Sant'Anna (2025). "Better Understanding Triple Differences Estimators." arXiv:2505.09942
  • Any intentional deviations from the source (and why):
    • Aggregation uses CallawaySantAnna mixin (cohort-size weights) rather than R's agg_ddd() group-probability weights — documented in REGISTRY.md
    • Propensity score trimming uses soft clipping vs R's hard exclusion — documented in REGISTRY.md
    • Survey design support deferred (raises NotImplementedError)
    • Cluster-robust analytical SEs deferred (bootstrap provides unit-level clustering)

Validation

  • Tests added/updated:
    • tests/test_staggered_triple_diff.py — 40 unit tests (init, fit, recovery, methods, event study, GMM, bootstrap, edge cases)
    • tests/test_methodology_staggered_triple_diff.py — 40 R cross-validation tests against pre-computed triplediff golden values
  • R cross-validation results: Group-time ATT and SE match R within 0.001% across 10 scenarios (3 seeds, 3 methods, both control group modes). Aggregation within 25% (documented weighting difference).
  • Paper review: docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md (not committed — gitignored)
  • R benchmark script: benchmarks/R/benchmark_staggered_triplediff.R

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…'Anna 2025)

Implements the staggered DDD estimator with group-time ATT(g,t), GMM-optimal
combination across comparison cohorts, event study aggregation, and multiplier
bootstrap. Core pairwise DiD computation matches R triplediff package exactly
(Riesz/Hajek normalization, PS + OR influence function corrections). Validated
against R: group-time ATT and SE match within 0.001% across 10 scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

This PR adds StaggeredTripleDifference, the staggered DDD estimator from Ortiz-Villavicencio & Sant’Anna (2025). The three-DiD / comparison-cohort / GMM structure is broadly aligned with the source material, but I found four unmitigated P1 issues: one undocumented methodology deviation, two inference/input-contract bugs, and one missing parameter-propagation issue. (arxiv.org)

Executive Summary

  • P1: The default propensity-score clipping changes the estimator when overlap is poor, but the new REGISTRY entry does not record that as a Note / Deviation from R.
  • P1: The implementation does not enforce the balanced-panel / stable-unit contract that the method and companion API expect.
  • P1: Bootstrap group-aggregation results are computed but never propagated back into returned group_effects.
  • P1: control_group is not propagated into the results object, so returned results lose essential estimand metadata.
  • P2: The new R cross-validation tests depend on CSV fixtures that are not committed in the PR, so they will skip in CI instead of validating parity.

Methodology

  • Severity: P1. Impact: diff_diff/staggered_triple_diff.py#L98 and diff_diff/staggered_triple_diff.py#L823 introduce substantive pscore_trim=0.01 clipping in the ATT weights, but the companion triplediff source only warns on poor overlap and then uses the estimated scores directly aside from an upper numerical cap near 1. The REGISTRY mention at docs/methodology/REGISTRY.md#L1269 is a plain bullet, not one of the allowed Note / Deviation from R labels, so under this review policy the deviation is still undocumented. Concrete fix: either remove substantive clipping from the default estimator path, or add a properly labeled REGISTRY deviation note and tests that lock in the intended divergence. (rdrr.io)
  • Severity: P1. Impact: diff_diff/staggered_triple_diff.py#L506 only validates eligibility and missingness, while diff_diff/staggered_triple_diff.py#L557 and diff_diff/staggered_triple_diff.py#L563 silently overwrite duplicate (unit,time) rows and collapse unit-level metadata with groupby(...).first(). That means unbalanced panels, time-varying first_treat, and time-varying covariates can be silently coerced into a different estimand instead of being rejected. Concrete fix: before precomputation, add explicit checks for unique (unit,time), equal period counts per unit, time-invariant first_treat, and time-invariant covariates when covariates is supplied. The companion API treats unbalanced panels as an explicit opt-in, and the paper still lists unbalanced / repeated-cross-section support as future work. (rdrr.io)
  • Severity: P1. Impact: diff_diff/staggered_bootstrap.py#L470 computes bootstrap group_effect_ses / CIs / p-values, but diff_diff/staggered_triple_diff.py#L423 only copies bootstrap inference back into overall ATT and event-study outputs. With n_bootstrap > 0 and aggregate="group" or "all", returned group_effects still show analytical SEs/p-values/CIs, which is the wrong inference path. Concrete fix: add a group-effect bootstrap update block matching the existing pattern in diff_diff/staggered.py#L1643.
  • Severity: P3. Impact: cluster is accepted in diff_diff/staggered_triple_diff.py#L99 but not wired into analytical SEs; this is explicitly documented in docs/methodology/REGISTRY.md#L1356, so it is non-blocking under the review rules. Concrete fix: optional warning when cluster is non-None.

Code Quality

  • No material findings beyond the issues above.

Performance

  • No material findings in the reviewed diff.

Maintainability

Tech Debt

  • No existing TODO.md entry mitigates the P1 issues above, so no downgrade applies.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Resolve the pscore_trim deviation by either matching the source / companion implementation in the default path or documenting the divergence with a properly labeled REGISTRY note.
  2. Add hard input checks for balanced panels, unique (unit,time) rows, invariant first_treat, and invariant covariates.
  3. Propagate bootstrap_results.group_effect_* into returned group_effects when bootstrap inference is requested.
  4. Add control_group to StaggeredTripleDiffResults and surface it in summaries / serialized output.
  5. Make the R cross-validation tests runnable in CI by supplying the missing CSV fixtures or generating them in test setup, and add a bootstrap+group-aggregation regression test.

- Add balanced-panel validation: unique (unit,time), equal periods per unit,
  time-invariant first_treat and covariates
- Propagate bootstrap group-effect SEs/CIs/p-values into returned results
- Add control_group to StaggeredTripleDiffResults and surface in summary/to_dict
- Add Deviation from R label for pscore_trim in REGISTRY.md
- Warn when cluster parameter is set (analytical cluster SEs not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f2e25fb48eecd4fea0b3d693553fdad1c712f7d1


Overall Assessment

⛔ Blocker

The re-review shows that several prior issues are fixed, but one unmitigated P0 remains in the new estimator path, along with multiple P1s. The blocking problem is incorrect notyettreated comparison-group construction when anticipation > 0, which can silently contaminate controls and bias ATT(g,t).

Executive Summary

  • The earlier issues around control_group result metadata, group-effect bootstrap propagation, and the labeled pscore_trim registry note appear addressed.
  • P0: anticipation is not propagated into the notyettreated comparison-cohort filter, so cohorts inside the anticipation window can still enter the control pool.
  • P1: bootstrap group-time inference is computed but never written back into results.group_time_effects, so callers still see analytical SEs/CIs/p-values after requesting bootstrap inference.
  • P1: the balanced-panel validation is still incomplete; equal per-unit period counts can pass even when units are missing different periods.
  • P1: the aggregation-weight deviation from agg_ddd() is described, but not under a **Note:** / **Deviation from R:** label, so under this review policy it is still undocumented.
  • P2: the new R cross-validation suite still skips in CI because it requires CSV fixtures that are not committed in this PR.

Methodology

  • Severity: P0. Impact: diff_diff/staggered_triple_diff.py:L282-L295 builds valid_gc from max(g, t) only, while anticipation only affects the base period in diff_diff/staggered_triple_diff.py:L676-L684. The project’s methodology registry documents the not-yet-treated rule as first_treat > max(t, base_period) + anticipation in docs/methodology/REGISTRY.md:L352-L356 and docs/methodology/REGISTRY.md:L414-L417. As written, anticipation=1 still lets a cohort treated at t+1 serve as a control at time t, which changes control composition and can bias estimates. Concrete fix: derive the comparison-group threshold from max(t, base_period_val) + self.anticipation for control_group="notyettreated" (or reject that parameter combination explicitly), and add a regression test where a cohort inside the anticipation window must be excluded.
  • Severity: P1. Impact: diff_diff/staggered_triple_diff.py:L439-L500 replaces overall, event-study, and group-level inference from bootstrap_results, but never updates group_time_effects. That leaves results.group_time_effects and to_dataframe("group_time") in diff_diff/staggered_triple_diff_results.py:L228-L257 reporting analytical SEs/CIs/p-values even when n_bootstrap > 0. The existing staggered DiD implementation already updates group-time inference in diff_diff/staggered.py:L1600-L1615. Concrete fix: add the same group-time bootstrap update block here before the event-study/group updates, and lock it with a regression test that compares group_time_effects[(g, t)]["se"] against bootstrap_results.group_time_ses[(g, t)].
  • Severity: P1. Impact: the new balanced-panel guard in diff_diff/staggered_triple_diff.py:L586-L593 only checks that each unit has the same number of observed periods, not that each unit has the full global time set. A panel where all units have three observations but are missing different periods still passes validation, then gets silently dropped cell-by-cell by the finite-data check in diff_diff/staggered_triple_diff.py:L730-L735, which violates the “requires balanced panel” contract documented in docs/methodology/REGISTRY.md:L1262-L1278. Concrete fix: require each unit’s observed time set to equal the full df[time].unique() set, or equivalently require per-unit unique period counts to equal the global period count.
  • Severity: P1. Impact: the PR explicitly normalizes a different aggregation estimand in tests/test_methodology_staggered_triple_diff.py:L181-L190 and describes the difference in docs/methodology/REGISTRY.md:L1391-L1395, but the StaggeredTripleDifference registry entry does not record that as a labeled **Note:** / **Deviation from R:**. Under the review policy, that means the weighting change is still an undocumented methodology deviation. Concrete fix: add a labeled **Deviation from R:** (or **Note (deviation from R):**) paragraph in the StaggeredTripleDifference section that explicitly states that event-study / simple aggregation reuse Callaway-Sant’Anna cohort-size weights instead of agg_ddd()’s group-probability weights.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings in the reviewed diff.

Maintainability

No additional findings beyond the missing bootstrap propagation branch already noted above.

Tech Debt

No findings. TODO.md does not already track the P0/P1 issues above, so no mitigation applies.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Fix notyettreated comparison-group eligibility so anticipation affects control composition, or explicitly reject that parameter combination until it is supported.
  2. Propagate bootstrap group_time_ses / CIs / p-values back into results.group_time_effects.
  3. Tighten the balanced-panel validation to require full common time support, not just equal per-unit counts.
  4. Add a labeled REGISTRY deviation note for the aggregation-weight change relative to agg_ddd().
  5. Make the new R methodology tests runnable in CI by committing or generating the required CSV fixtures.

Verification note: I could not run the test suite in this environment because pytest and numpy are unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant