Skip to content

Commit 4646f4d

Browse files
authored
Merge pull request #464 from igerber/agent-workflow-discoverability
Surface agent_workflow() + curated dir() for LLM discoverability (#460)
2 parents 956445e + 3968f0c commit 4646f4d

7 files changed

Lines changed: 810 additions & 16 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
### Added
11+
- **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `<Estimator>(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460).
12+
- **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard.
13+
- **`dir(diff_diff)` now surfaces agent-facing entrypoints first** via a module-level `__dir__()` override paired with a small `_OrderedName(str)` subclass that subverts CPython's unconditional alphabetic sort (PyList_Sort respects `__lt__` on the elements). Agent-facing names (`agent_workflow`, `profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`, `DiagnosticReport`) appear at the head of the list; the remainder stays alphabetic via the `str.__lt__` fallback. The underlying `__all__` membership is **unchanged** and `from diff_diff import *` semantics are unaffected (driven by `__all__`, not `dir()`). Elements are `isinstance(x, str)` and compatible with `inspect.getmembers`, dict-key lookup, f-strings, and standard `str` methods; tooling that re-sorts via `sorted(dir(diff_diff))` will see priority order (use `sorted(dir(diff_diff), key=str)` to recover plain alphabetic if needed). Internal: `_AGENT_FACING_ORDER` tuple is read by the new `tests/test_agent_discoverability.py` contract test (PR B). Addresses [issue #460](https://github.com/igerber/diff-diff/issues/460) item 3.
1114
- **`MultiPeriodDiD(cluster=..., vcov_type="hc2_bm")` now supported** (`diff_diff/estimators.py:1657`). Pre-PR the combination raised `NotImplementedError` because the cluster-aware CR2 Bell-McCaffrey Satterthwaite DOF for the post-period-average ATT (`avg_att = (1/n_post) Σ_{t ≥ t_treat} β_t`) was not implemented — only the per-coefficient case existed in `_compute_cr2_bm`. New `_compute_cr2_bm_contrast_dof` helper in `diff_diff/linalg.py` generalizes the per-coefficient loop to arbitrary `(k, m)` contrast matrices using the identical Pustejovsky-Tipton 2018 Section 4 algebra; `_compute_cr2_bm` is refactored to call it with `contrasts=eye(k)` so the existing per-coefficient parity to clubSandwich's `coef_test$df_Satt` is preserved (refactor regression at atol=1e-10). `MultiPeriodDiD.fit()` extends its existing avg_att DOF block to branch on `effective_cluster_ids`: one-way `_compute_bm_dof_from_contrasts` when None, cluster-aware `_compute_cr2_bm_contrast_dof` otherwise. Cluster IDs are per-observation length `n` and are NOT subscripted by the rank-deficient column-drop mask. R parity verified at atol=1e-10 against clubSandwich's `Wald_test(constraints=matrix(c, 1), test="HTZ")$df_denom` on the new `mpd_clustered_avg_att_dof` fixture in `benchmarks/data/clubsandwich_cr2_golden.json` (Wald_test's HTZ on a 1-row constraint matrix yields the Satterthwaite t-test DOF). Per-coefficient `period_effects[t].p_value` / `conf_int` and `avg_att` `avg_p_value` / `avg_conf_int` now reflect the correct Satterthwaite DOF rather than the n-k fallback under cluster+hc2_bm. Weighted CR2-BM (`survey_design=` paths) remains a separate gate. New tests: `tests/test_linalg_hc2_bm.py::TestCR2BMContrastDOF` (4 tests: refactor regression, R-parity, shape validation, cluster-count validation); existing `test_multi_period_cluster_plus_hc2_bm_rejected` flipped to behavioral `test_multi_period_cluster_plus_hc2_bm_produces_finite_inference`.
1215
- **`MultiPeriodDiD(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:1476`). Mirrors the DiD-absorb auto-route shipped earlier in this release: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, `MultiPeriodDiD.fit()` promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov on the event-study design (`treated + period_X dummies + treated:period_X interactions + factor(unit)`). Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=1:n, type="CR2")` on a 5-cohort × 5-period event-study fixture (new `tests/test_estimators_vcov_type.py::TestMPDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `mpd_absorbed_fe_did`). HC1/CR1 paths on `absorb=` are unchanged (no leverage term). `TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` rejection remains as a follow-up (different fit-path structure — no `fixed_effects=` equivalent inside TWFE). **Behavioral note (full `MultiPeriodDiDResults` surface change under auto-route):** under the auto-route, the entire returned `MultiPeriodDiDResults` reflects the full-dummy fit rather than the within-transformed fit — `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, `result.r_squared` all include the FE-dummy entries / un-demeaned values. `result.period_effects[t].effect` / `.se` / `.p_value` / `.conf_int` and `result.avg_att` / `.avg_se` are invariant to this routing (FWL guarantee). MPD requires a time-invariant ever-treated indicator that lies in the span of the intercept and the post-auto-route unit FE dummies (the exact alias depends on the omitted FE reference category under `pd.get_dummies(drop_first=True)`, not just on "the sum of treated-cohort unit dummies"), so `solve_ols` drops one column from that collinear set under R-style rank-deficiency handling. Which specific column is dropped is pivot-order and dummy-coding dependent (in the shipped parity fixture it is a never-treated unit dummy, not the `treated` main effect itself). The per-period interaction coefficients (`treated:period_X`) and `avg_att` are identified and invariant to that choice; parity tests target those rather than the `treated` main effect. **Survey-design scope (replicate weights):** when `survey_design=` uses replicate weights, the auto-route short-circuits the absorb-refit branch at `estimators.py:1693` and routes through the standard `compute_replicate_vcov` path on the fixed full-dummy design — correct because the design does not depend on replicate weights so no per-replicate refit is needed. **Redundant time-FE skip:** when the routed (or directly-supplied) `fixed_effects` list contains the `time` column, MPD silently skips emitting `<time>_<X>` dummies for that entry because the design already absorbs the time dimension via the non-reference period dummies; without the skip, the two blocks would collide on dummy names and the `coefficients` dict would silently collapse duplicates under `var_names`-keyed construction, breaking the coefficients-vs-vcov alignment that downstream consumers rely on. This applies to both the new `absorb=` auto-route and the pre-existing `fixed_effects=[<time_col>]` invocation.
1316
- **`DifferenceInDifferences(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:382`). Previously raised `NotImplementedError` because the HC2 leverage correction and CR2 Bell-McCaffrey DOF depend on the FULL FE hat matrix, while within-transformation (FWL) preserves coefficients and residuals but not the hat. Lift via internal auto-route: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, the fit promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov. Empirically matches `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=..., type="CR2")` at ~1e-10 (verified via new `tests/test_estimators_vcov_type.py::TestDiDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `absorbed_fe_did`, with the R generator using the singleton-cluster CR2 trick for one-way HC2-BM Satterthwaite DOF). HC1/CR1 paths unchanged. `MultiPeriodDiD(absorb=...)` and `TwoWayFixedEffects` rejections remain as follow-ups (different fit-path structure). **Behavioral note (full `DiDResults` surface change under auto-route):** under the auto-route, the entire returned `DiDResults` reflects the full-dummy fit rather than the within-transformed fit. Specifically, `result.coefficients` and `result.vcov` include the FE-dummy entries (matching the `fixed_effects=` path), `result.residuals` and `result.fitted_values` are on the un-demeaned outcome scale, and `result.r_squared` is computed on the un-demeaned outcome (so it absorbs the FE variance and will typically be higher than the within-R²). `result.att` is invariant to this routing (FWL guarantee). Downstream consumers reading `result.att` are unaffected; consumers reading the broader result surface should expect the full-dummy values. **Survey-design scope:** the auto-route changes the FE handling (and removes the prior absorbed-FE rejection), but `survey_design=` continues to drive its own variance path (Taylor-series linearization or replicate-weight variance, per the existing survey contract) rather than the analytical HC2/HC2-BM sandwich. The auto-route is therefore methodologically meaningful for non-survey fits and for the FE-handling side of survey fits; analytical small-sample inference under `vcov_type in {"hc2","hc2_bm"}` is bypassed when a survey design is supplied.

TODO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ Deferred items from PR reviews that were not addressed before merge.
162162
| Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files) | `docs/doc-deps.yaml` | #269 | Low |
163163
| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
164164
| AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
165+
| `__dir__()` discoverability contract test (head order, membership, `_OrderedName` invariants, `inspect.getmembers` parity) — deferred from PR #464 to the planned PR B addressing #461. The full snapshot/contract surface lands together in `tests/test_agent_discoverability.py`. | `diff_diff/__init__.py::__dir__`, `tests/test_agent_discoverability.py` (new in PR B) | #464 | Low |
165166
| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
166167
| Render `docs/methodology/REPORTING.md` and `docs/methodology/REGISTRY.md` as in-site Sphinx pages so cross-references can use `:doc:` instead of off-site GitHub `blob/main` URLs. Current state (#410 fix-audit-r2) restores navigable links via `blob/main`, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add `myst-parser` to `docs/conf.py` extensions + docs extras and link with `:doc:`, or (b) convert both files to `.rst`. | `docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, `docs/tutorials/18_geo_experiments.ipynb`, `docs/tutorials/19_dcdh_marketing_pulse.ipynb` | follow-up | Low |
167168

diff_diff/__init__.py

Lines changed: 84 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
"""
2-
diff-diff: A library for Difference-in-Differences analysis.
3-
4-
This library provides sklearn-like estimators for causal inference
5-
using the difference-in-differences methodology.
1+
"""diff-diff: Difference-in-Differences causal inference with sklearn-like API.
2+
Recommended starting call for LLM agents:
3+
``diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)``
4+
prints a copy-pasteable workflow with your column names wired in.
65
7-
For AI agents:
6+
The orchestrator names the full sequence:
87
9-
1. Describe your data: ``diff_diff.profile_panel(df, unit=..., time=...,
10-
treatment=..., outcome=...)``
11-
2. Consult the reference: ``diff_diff.get_llm_guide("autonomous")``
8+
1. Describe the panel: diff_diff.profile_panel(df, ...)
9+
2. Choose an estimator: diff_diff.get_llm_guide("autonomous")
1210
(estimator-support matrix + reasoning)
13-
3. Follow the workflow: ``diff_diff.get_llm_guide("practitioner")``
14-
(Baker et al. (2025) 8-step recipe)
15-
4. Report results: ``diff_diff.BusinessReport(results)``
16-
(structured agent-legible output)
11+
3. Fit: <Estimator>(...).fit(df, ...)
12+
4. Validate: diff_diff.practitioner_next_steps(result)
13+
5. Report: diff_diff.BusinessReport(result)
14+
15+
For a comprehensive API reference call ``diff_diff.get_llm_guide("full")``.
16+
For the Baker et al. (2025) 8-step practitioner recipe call
17+
``diff_diff.get_llm_guide("practitioner")``.
1718
18-
For a comprehensive API reference call ``diff_diff.get_llm_guide("full")``;
19-
``practitioner_next_steps(results)`` returns context-aware guidance after
20-
any estimator's ``fit()``.
19+
This library provides sklearn-like estimators for causal inference using
20+
the difference-in-differences methodology.
2121
"""
2222

2323
# Import backend detection from dedicated module (avoids circular imports)
@@ -256,6 +256,7 @@
256256
DiagnosticReportResults,
257257
)
258258
from diff_diff._guides_api import get_llm_guide
259+
from diff_diff.agent_workflow import agent_workflow
259260
from diff_diff.profile import (
260261
Alert,
261262
OutcomeShape,
@@ -503,6 +504,7 @@
503504
"list_datasets",
504505
"clear_cache",
505506
# Practitioner guidance
507+
"agent_workflow",
506508
"practitioner_next_steps",
507509
"BusinessReport",
508510
"BusinessContext",
@@ -519,3 +521,69 @@
519521
# LLM guide accessor
520522
"get_llm_guide",
521523
]
524+
525+
# Agent-facing entrypoints surface first in dir(diff_diff). LLM agents
526+
# follow a `dir -> help -> docstring -> use` discovery loop; surfacing
527+
# these names first measurably improves discoverability vs the default
528+
# alphabetic ordering. Internal — read by tests/test_agent_discoverability.py.
529+
_AGENT_FACING_ORDER = (
530+
"agent_workflow",
531+
"profile_panel",
532+
"get_llm_guide",
533+
"practitioner_next_steps",
534+
"BusinessReport",
535+
"DiagnosticReport",
536+
)
537+
538+
539+
class _OrderedName(str):
540+
"""str subclass that sorts by _AGENT_FACING_ORDER priority.
541+
542+
Python's built-in dir() always sorts the result of __dir__()
543+
alphabetically (CPython Objects/object.c::_dir_object unconditionally
544+
calls PyList_Sort), so returning a list in our preferred order is
545+
not enough. But PyList_Sort uses __lt__ for comparisons, so a str
546+
subclass with a custom __lt__ can subvert the alphabetic default
547+
while remaining a fully usable str for every other operation.
548+
549+
ALL names returned by __dir__() must be _OrderedName, not just the
550+
priority head: when Python compares an _OrderedName against a plain
551+
str, the reflected-method protocol prefers str's inherited __gt__
552+
(because _OrderedName is a subclass of str), which sorts purely
553+
alphabetically and breaks the ordering. With every element wrapped,
554+
all comparisons go through this __lt__: priority head sorts to
555+
front, tail (default priority 1<<30) falls through to alphabetic
556+
via str.__lt__.
557+
"""
558+
559+
_ORDER = {n: i for i, n in enumerate(_AGENT_FACING_ORDER)}
560+
561+
def __lt__(self, other):
562+
sp = self._ORDER.get(str(self), 1 << 30)
563+
op = self._ORDER.get(str(other), 1 << 30)
564+
if sp != op:
565+
return sp < op
566+
return str.__lt__(self, other)
567+
568+
569+
def __dir__():
570+
"""Surface agent-facing entrypoints first; remainder alphabetic.
571+
572+
Returns the full module namespace (matching default `dir(module)`
573+
membership — keeps `__doc__`, `__name__`, etc. accessible via
574+
`inspect.getmembers`) with priority names re-ordered to the head
575+
via `_OrderedName`'s custom `__lt__`.
576+
577+
`__all__` order does not affect `dir(module)`. CPython sorts the
578+
result of `__dir__()` alphabetically, so we return `_OrderedName`
579+
instances (str subclass with custom `__lt__`) for every name; the
580+
custom comparison routes head names to the top and falls back to
581+
alphabetic for everyone else. See `_OrderedName` docstring for
582+
why ALL names must be wrapped (mixing plain `str` with the
583+
subclass triggers Python's reflected-method comparison protocol
584+
and breaks the ordering).
585+
586+
`from diff_diff import *` semantics are unaffected (driven by
587+
`__all__`, not by `dir()`).
588+
"""
589+
return [_OrderedName(n) for n in globals()]

0 commit comments

Comments
 (0)