igerber · igerber · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **Agent-discoverability contract test (`tests/test_agent_discoverability.py`).** New static-snapshot test pinning the agent-facing surface introduced by PR #464: `__all__` membership of `agent_workflow` / `profile_panel` / `get_llm_guide` / `practitioner_next_steps` / `BusinessReport`; `dir(diff_diff)` head-first ordering against `_AGENT_FACING_ORDER` (catches drift in the `_OrderedName` `__lt__` ordering trick); `_OrderedName` `isinstance(_, str)` + str-method compatibility; `dir()` full-namespace + `inspect.getmembers` parity; top-level `__doc__` first-paragraph mention of `agent_workflow` + named references to the 5-step workflow primitives; `agent_workflow()` script content references each downstream helper by name; canonical estimator class names (CallawaySantAnna, ContinuousDiD, HeterogeneousAdoptionDiD, etc.) remain importable. No live API calls; runs in the default pytest suite. Closes [issue #461](https://github.com/igerber/diff-diff/issues/461) (snapshot variant — live-agent regression test deferred to a separate follow-up that depends on causal-llm-eval packaging its harness). Also closes the `__dir__()` contract-test row from `TODO.md` that PR #464 deferred here.
 - **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `<Estimator>(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460).
 - **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard.
 - **`dir(diff_diff)` now surfaces agent-facing entrypoints first** via a module-level `__dir__()` override paired with a small `_OrderedName(str)` subclass that subverts CPython's unconditional alphabetic sort (PyList_Sort respects `__lt__` on the elements). Agent-facing names (`agent_workflow`, `profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`, `DiagnosticReport`) appear at the head of the list; the remainder stays alphabetic via the `str.__lt__` fallback. The underlying `__all__` membership is **unchanged** and `from diff_diff import *` semantics are unaffected (driven by `__all__`, not `dir()`). Elements are `isinstance(x, str)` and compatible with `inspect.getmembers`, dict-key lookup, f-strings, and standard `str` methods; tooling that re-sorts via `sorted(dir(diff_diff))` will see priority order (use `sorted(dir(diff_diff), key=str)` to recover plain alphabetic if needed). Internal: `_AGENT_FACING_ORDER` tuple is read by the new `tests/test_agent_discoverability.py` contract test (PR B). Addresses [issue #460](https://github.com/igerber/diff-diff/issues/460) item 3.

diff --git a/TODO.md b/TODO.md
@@ -162,7 +162,6 @@ Deferred items from PR reviews that were not addressed before merge.
 | Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files) | `docs/doc-deps.yaml` | #269 | Low |
 | SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
 | AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
-| `__dir__()` discoverability contract test (head order, membership, `_OrderedName` invariants, `inspect.getmembers` parity) — deferred from PR #464 to the planned PR B addressing #461. The full snapshot/contract surface lands together in `tests/test_agent_discoverability.py`. | `diff_diff/__init__.py::__dir__`, `tests/test_agent_discoverability.py` (new in PR B) | #464 | Low |
 | `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
 | Render `docs/methodology/REPORTING.md` and `docs/methodology/REGISTRY.md` as in-site Sphinx pages so cross-references can use `:doc:` instead of off-site GitHub `blob/main` URLs. Current state (#410 fix-audit-r2) restores navigable links via `blob/main`, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add `myst-parser` to `docs/conf.py` extensions + docs extras and link with `:doc:`, or (b) convert both files to `.rst`. | `docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, `docs/tutorials/18_geo_experiments.ipynb`, `docs/tutorials/19_dcdh_marketing_pulse.ipynb` | follow-up | Low |
 

diff --git a/tests/test_agent_discoverability.py b/tests/test_agent_discoverability.py
@@ -0,0 +1,290 @@
+"""Contract test for the agent-discoverability surface (issue #461).
+
+This is a static snapshot test of the four contract surfaces that PR
+#464 introduced for LLM-agent discovery:
+
+1. ``__all__`` membership of agent-facing primitives
+2. ``dir(diff_diff)`` head-first ordering (via the ``_OrderedName`` trick)
+3. Top-level ``__doc__`` content (first paragraph names the recommended
+   call; the 5-step workflow primitives all appear)
+4. ``agent_workflow()`` output references the canonical downstream
+   primitives by name
+
+It also locks the ``__dir__()`` invariants (head matches
+``_AGENT_FACING_ORDER``, tail is alphabetic by ``str``, module dunders
+are preserved, ``inspect.getmembers`` parity).
+
+Closes the ``__dir__`` contract-test deferral row from PR #464's
+``TODO.md``.
+
+No live API calls, no subprocess, no live agents — purely string/identity
+assertions runnable in the default ``pytest`` suite.
+"""
+
+from __future__ import annotations
+
+import inspect
+
+import pandas as pd
+import pytest
+
+import diff_diff
+from diff_diff import _AGENT_FACING_ORDER
+
+# ---------------------------------------------------------------------------
+# __all__ membership
+# ---------------------------------------------------------------------------
+
+
+def test_agent_facing_names_in_all():
+    """The named primitives must remain in the public API surface.
+
+    Catches an export pruning that would silently remove an agent-facing
+    name from ``from diff_diff import *``. Required set mirrors
+    ``_AGENT_FACING_ORDER`` so adding a name to the head order also
+    enforces its presence in ``__all__``.
+    """
+    required: set[str] = set(_AGENT_FACING_ORDER)
+    all_names: set[str] = set(diff_diff.__all__)
+    assert required <= all_names, f"missing from __all__: {required - all_names}"
+
+
+def test_estimator_class_names_importable():
+    """Class-name renames silently break agent recognition.
+
+    The canonical staggered estimators + the simple-2x2 case must remain
+    importable under their documented names; the orchestrator's Step 3
+    examples and ``llms-autonomous.txt`` routing matrix reference them
+    by these literal identifiers.
+    """
+    from diff_diff import (  # noqa: F401
+        CallawaySantAnna,
+        ChaisemartinDHaultfoeuille,
+        ContinuousDiD,
+        DifferenceInDifferences,
+        HeterogeneousAdoptionDiD,
+        HonestDiD,
+        ImputationDiD,
+        PreTrendsPower,
+        SunAbraham,
+        TwoWayFixedEffects,
+        WooldridgeDiD,
+    )
+
+
+# ---------------------------------------------------------------------------
+# __dir__() head-first ordering + _OrderedName invariants
+# ---------------------------------------------------------------------------
+
+
+def test_dir_head_matches_agent_facing_order():
+    """``dir(diff_diff)`` must surface ``_AGENT_FACING_ORDER`` at the
+    head, IN THE DECLARED ORDER.
+
+    Anchors to the contract (the override's curated tuple) rather than
+    a fixed slice length: if a future change adds or trims the head
+    tuple, this test follows it. Catches the failure mode where
+    ``__dir__()`` is dropped, mis-ordered, or where the
+    ``_OrderedName`` ``__lt__`` is broken.
+    """
+    names = dir(diff_diff)
+    head_size = len(_AGENT_FACING_ORDER)
+    assert names[:head_size] == list(_AGENT_FACING_ORDER), (
+        f"dir() head does not match _AGENT_FACING_ORDER. "
+        f"Got: {names[:head_size]!r}. "
+        f"Expected: {list(_AGENT_FACING_ORDER)!r}."
+    )
+
+
+def test_dir_tail_alphabetic_by_str():
+    """The non-head portion of ``dir()`` should stay alphabetic when
+    keyed by ``str``.
+
+    The ``_OrderedName`` head members compare with custom ``__lt__``
+    (priority then alphabetic); tail elements are plain strings sorted
+    by CPython's ``PyList_Sort``. ``sorted(tail, key=str)`` is the
+    canonical recovery key in case any downstream tooling re-sorts.
+    """
+    names = dir(diff_diff)
+    tail = names[len(_AGENT_FACING_ORDER) :]
+    assert tail == sorted(tail, key=str)
+
+
+def test_dir_returns_full_module_namespace():
+    """``dir(diff_diff)`` must enumerate the full module namespace.
+
+    Compared against ``vars(diff_diff)`` (the real underlying module
+    namespace, which `__dir__` does NOT derive from automatically) so
+    a regression that reduces `__dir__` to only `__all__` would fail
+    here. Without this check, a tautology was hiding: CPython's
+    `inspect.getmembers()` derives its name list FROM `dir(obj)`, so
+    `dir() == getmembers()` is always true regardless of how narrow
+    `__dir__` becomes.
+
+    The override returns ``[_OrderedName(n) for n in globals()]`` to
+    preserve `vars()`-equivalent coverage; this assertion locks that
+    contract.
+    """
+    dir_names = {str(n) for n in dir(diff_diff)}
+    vars_names = set(vars(diff_diff))
+    assert dir_names == vars_names, (
+        f"dir() is not equal to vars() namespace. "
+        f"In dir() only: {sorted(dir_names - vars_names)[:8]!r}; "
+        f"In vars() only: {sorted(vars_names - dir_names)[:8]!r}."
+    )
+    # Spot-check the dunders that downstream tooling expects.
+    for dunder in ("__doc__", "__name__", "__file__", "__all__"):
+        assert dunder in dir_names, f"{dunder!r} missing from dir() output"
+
+
+def test_getmembers_returns_accessible_values():
+    """``inspect.getmembers(diff_diff)`` is the standard agent-facing
+    introspection call. Its name list is derived from ``dir()`` (see
+    `test_dir_returns_full_module_namespace` for the independent
+    membership check); here we verify every reported name produces an
+    accessible value with no AttributeError, AND that the steering
+    surface (`__doc__`) is reachable through it.
+    """
+    members = dict(inspect.getmembers(diff_diff))
+    # Every name must resolve to an accessible value.
+    for name in members:
+        getattr(diff_diff, name)
+    # And the steering surface must be accessible + content-correct.
+    assert diff_diff.__doc__ is not None
+    assert "agent_workflow" in diff_diff.__doc__.lower()
+    assert members["__doc__"] is diff_diff.__doc__
+
+
+# ---------------------------------------------------------------------------
+# _OrderedName subclass invariants
+# ---------------------------------------------------------------------------
+
+
+def test_ordered_name_isinstance_str():
+    """Every ``dir()`` element must still be ``isinstance(..., str)`` so
+    consumers that type-check don't break.
+    """
+    for name in dir(diff_diff):
+        assert isinstance(
+            name, str
+        ), f"dir() element {name!r} is type {type(name).__name__}, not a str subclass"
+
+
+def test_ordered_name_str_methods_work():
+    """The head ``_OrderedName`` instances must support all the str
+    operations downstream tooling relies on (upper, eq, hash for dict
+    keys, ``in`` membership, f-string interpolation).
+    """
+    head = dir(diff_diff)[: len(_AGENT_FACING_ORDER)]
+    for n in head:
+        assert n.upper() == str(n).upper()
+        assert n == str(n)
+        assert {n: 1}.get(n) == 1
+        assert n in [str(n)]
+        assert f"{n}" == str(n)
+
+
+# ---------------------------------------------------------------------------
+# __doc__ first-paragraph contract
+# ---------------------------------------------------------------------------
+
+
+def test_doc_first_paragraph_names_agent_workflow():
+    """``help(diff_diff)`` opens with ``__doc__``; the first non-blank
+    paragraph must name ``agent_workflow``.
+
+    Catches a docstring rewrite that drops the recommended-call hint
+    from the top-of-help surface.
+    """
+    doc = diff_diff.__doc__
+    assert doc is not None
+    first_block = doc.strip().split("\n\n")[0]
+    assert "agent_workflow" in first_block.lower()
+
+
+def test_doc_names_canonical_workflow_helpers():
+    """The full 5-step workflow's primitive names must remain reachable
+    from ``help(diff_diff)``.
+
+    Catches a docstring trim that removes references to the downstream
+    helpers an agent following the doc would call next.
+    """
+    assert diff_diff.__doc__ is not None
+    doc_lower = diff_diff.__doc__.lower()
+    for name in (
+        "profile_panel",
+        "get_llm_guide",
+        "practitioner_next_steps",
+        "businessreport",
+    ):
+        assert name in doc_lower, f"{name!r} missing from __doc__"
+
+
+# ---------------------------------------------------------------------------
+# agent_workflow() output references the canonical primitives
+# ---------------------------------------------------------------------------
+
+
+def test_agent_workflow_output_names_canonical_helpers():
+    """Calling ``agent_workflow()`` must still produce a script that
+    names the four downstream primitives. Catches the orchestrator
+    content drifting away from the helpers it advertises.
+    """
+    df = pd.DataFrame({"u": [1], "t": [0], "tr": [0], "y": [0.0]})
+    out = diff_diff.agent_workflow(
+        df,
+        unit="u",
+        time="t",
+        treatment="tr",
+        outcome="y",
+        verbose=False,
+    )
+    for name in (
+        "profile_panel",
+        "get_llm_guide",
+        "practitioner_next_steps",
+        "BusinessReport",
+    ):
+        assert name in out["script"], f"{name!r} missing from agent_workflow script"
+
+
+def test_agent_workflow_fit_candidates_resolve_on_diff_diff():
+    """Every estimator advertised in ``agent_workflow().fit_candidates``
+    must be a real attribute on the ``diff_diff`` namespace.
+
+    Mirrors the per-PR test in ``test_agent_workflow.py``; here we
+    re-assert as part of the discoverability contract so a rename
+    that escapes the per-PR suite is still caught at the surface
+    level.
+    """
+    df = pd.DataFrame({"u": [1], "t": [0], "tr": [0], "y": [0.0]})
+    out = diff_diff.agent_workflow(
+        df,
+        unit="u",
+        time="t",
+        treatment="tr",
+        outcome="y",
+        verbose=False,
+    )
+    missing = [n for n in out["fit_candidates"] if not hasattr(diff_diff, n)]
+    assert not missing, f"fit_candidates not on diff_diff namespace: {missing}"
+
+
+# ---------------------------------------------------------------------------
+# Cross-surface sanity (all four agent-facing entrypoints callable)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("name", sorted(_AGENT_FACING_ORDER))
+def test_agent_facing_entrypoint_callable(name):
+    """Each name in ``_AGENT_FACING_ORDER`` must remain a callable
+    attribute on the top-level package.
+
+    Catches an accidental replacement of one of these names with a
+    module or constant (which would silently break the agent's
+    ``help(name)`` follow-up). Anchored to ``_AGENT_FACING_ORDER`` so
+    adding a new head name automatically extends coverage rather than
+    needing two sources of truth.
+    """
+    obj = getattr(diff_diff, name)
+    assert callable(obj), f"{name!r} is not callable on the diff_diff namespace"