Skip to content

Add agent-discoverability contract test (#461)#467

Open
igerber wants to merge 2 commits into
mainfrom
agent-discoverability-tests
Open

Add agent-discoverability contract test (#461)#467
igerber wants to merge 2 commits into
mainfrom
agent-discoverability-tests

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 19, 2026

Summary

What's locked by the new test file:

  1. __all__ membership of the 5 agent-facing primitives (agent_workflow, profile_panel, get_llm_guide, practitioner_next_steps, BusinessReport)
  2. dir(diff_diff) head matches _AGENT_FACING_ORDER in declared order — anchors to the tuple, not a slice length, so trims/expansions to the head are tracked
  3. dir() tail stays alphabetic when keyed by str (recovery key for tooling that re-sorts)
  4. dir() returns the full module namespace, not just __all__ — preserves __doc__/__name__/__file__ for inspect.getmembers consumers
  5. _OrderedName invariants: isinstance(_, str), str methods (.upper, ==, hash for dict keys, in, f-string)
  6. inspect.getmembers(diff_diff) parity with dir(diff_diff)
  7. Top-level __doc__ first non-blank paragraph names agent_workflow; the full doc names all 4 downstream primitives
  8. agent_workflow() output script references each canonical helper by name; every fit_candidates entry resolves on the diff_diff namespace
  9. Canonical estimator class names (CallawaySantAnna, ChaisemartinDHaultfoeuille, ContinuousDiD, DifferenceInDifferences, HeterogeneousAdoptionDiD, HonestDiD, ImputationDiD, PreTrendsPower, SunAbraham, TwoWayFixedEffects, WooldridgeDiD) remain importable
  10. Each agent-facing entrypoint stays callable

Closes #461 (snapshot variant). The live-agent regression test (spawning a cold-start Claude subprocess against a staggered-DiD task) remains a follow-up that depends on causal-llm-eval packaging its harness module — that path will be tracked as its own follow-up issue if not already.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no estimator, math, variance, or default inference behavior changes.
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_agent_discoverability.py (new, 17 tests). All pass locally; existing tests/test_agent_workflow.py (23) and tests/test_guides.py (41) remain green for a combined 81/81.
  • Backtest / simulation / notebook evidence (if applicable): N/A — pure in-process inspection.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

New `tests/test_agent_discoverability.py` pins the agent-facing surface
introduced by PR #464 against future regression. Snapshot/static
assertions only — no live API calls, no subprocess, runs in the default
pytest suite.

What's locked:

1. `__all__` membership of agent_workflow / profile_panel /
   get_llm_guide / practitioner_next_steps / BusinessReport
   (catches export pruning).
2. `dir(diff_diff)` head-first ordering matches `_AGENT_FACING_ORDER`
   (catches drift in `_OrderedName.__lt__` or `__dir__()` regression).
3. `dir()` tail stays alphabetic when keyed by `str` (recovery key for
   downstream tooling that re-sorts).
4. `dir()` returns the FULL module namespace, not just `__all__`
   (preserves `__doc__` / `__name__` / `__file__` for
   `inspect.getmembers` consumers).
5. `_OrderedName` invariants: `isinstance(_, str)` holds, str methods
   work (upper, eq, hash, `in`, f-string).
6. Top-level `__doc__` first non-blank paragraph names
   `agent_workflow`; full doc text names the 4 downstream primitives.
7. `agent_workflow()` output script references each canonical helper
   by name; every `fit_candidates` entry resolves on the diff_diff
   namespace.
8. Canonical estimator class names (CallawaySantAnna,
   ChaisemartinDHaultfoeuille, ContinuousDiD, DifferenceInDifferences,
   HeterogeneousAdoptionDiD, HonestDiD, ImputationDiD, PreTrendsPower,
   SunAbraham, TwoWayFixedEffects, WooldridgeDiD) remain importable.
9. Each agent-facing entrypoint stays callable.

17 tests (12 standalone + 5 parametrize cells over the agent-facing
entrypoint names).

Closes #461 (snapshot variant). The live-agent regression test
remains a follow-up that depends on causal-llm-eval packaging its
harness module. Also closes the `__dir__()` contract-test row from
PR #464's TODO.md (deferred there, landed here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. This PR only adds a contract test plus CHANGELOG.md/TODO.md updates; it does not change estimator code, identification logic, weighting, variance/SE computation, or defaults. I cross-checked the affected surface against the Methodology Registry and the relevant package/docstring contracts in diff_diff/__init__.py and diff_diff/agent_workflow.py.

Executive Summary

  • No methodology-governed implementation changed; the estimator/inference contracts in docs/methodology/REGISTRY.md:L40-L178 are not implicated by this diff.
  • P2: the new “full namespace + inspect.getmembers parity” claim is under-tested. The current parity assertion can miss dropped non-dunder names while the PR closes the deferred TODO and advertises the contract as pinned (tests/test_agent_discoverability.py:L118-L144, CHANGELOG.md:L11-L14).
  • P3: DiagnosticReport is part of the surfaced head order in _AGENT_FACING_ORDER, but the new explicit public-surface/callability assertions exclude it (diff_diff/__init__.py:L529-L536, tests/test_agent_discoverability.py:L39-L54, tests/test_agent_discoverability.py:L267-L288).
  • The changelog/TODO edits are otherwise consistent with the new test file.
  • I could not execute pytest in this environment: pytest is unavailable, and importing diff_diff fails here because numpy is not installed.

Methodology

No findings. The diff only touches tests/test_agent_discoverability.py, CHANGELOG.md, and TODO.md; no estimator/math/variance/default-behavior path covered by docs/methodology/REGISTRY.md:L40-L178 was modified.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the test-coverage gaps noted below.

Tech Debt

  • Severity: P2
    Impact: The PR removes the deferred contract-test item and advertises “dir() full-namespace + inspect.getmembers parity” as covered, but test_getmembers_parity_with_default_module_dir() is not an independent membership check. In CPython, inspect.getmembers() derives its name list from dir(), so dir_names == gm_names at tests/test_agent_discoverability.py:L137-L140 can still pass if __dir__() drops non-head, non-dunder names. Because test_dir_returns_full_module_namespace() only spot-checks four dunders at tests/test_agent_discoverability.py:L126-L128, the full-namespace contract described in CHANGELOG.md:L11-L14 is not actually pinned.
    Concrete fix: Add an explicit membership assertion against the real module namespace, e.g. compare set(map(str, dir(diff_diff))) to set(vars(diff_diff)) (or an explicit allowlisted equivalent), and keep the inspect.getmembers() check as a secondary accessibility/value check rather than the primary membership proof.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: _AGENT_FACING_ORDER treats DiagnosticReport as part of the surfaced agent-facing head (diff_diff/__init__.py:L529-L536), and the changelog advertises it that way (CHANGELOG.md:L14), but the new explicit __all__ and callability checks only cover five names and omit DiagnosticReport (tests/test_agent_discoverability.py:L45-L54, tests/test_agent_discoverability.py:L267-L288). A future regression that leaves DiagnosticReport in dir(diff_diff) but de-exports it or makes it non-callable would not fail this new contract suite.
    Concrete fix: Either add DiagnosticReport to the explicit public-surface assertions, or narrow the changelog/contract wording so the test is only claiming to pin the five workflow primitives.

Verification note: static review only; I could not run the new test file in this environment because pytest is missing and diff_diff cannot be imported without numpy.

…overage)

CI codex flagged 2 actionable items on the initial push:

P2 (Tech Debt): `test_getmembers_parity_with_default_module_dir()` was
tautological. CPython's `inspect.getmembers()` derives its name list
from `dir(obj)`, so the `dir_names == gm_names` check was always true
regardless of how narrow `__dir__()` became. The "full namespace"
contract advertised in the CHANGELOG was therefore not actually
pinned.

Fix: split into two tests with independent assertions:
- `test_dir_returns_full_module_namespace` now compares
  `set(map(str, dir(diff_diff)))` against `set(vars(diff_diff))` —
  vars() reads the real underlying module dict and `__dir__` does NOT
  derive from it automatically, so this catches a regression where
  `__dir__` is reduced to `__all__`.
- `test_getmembers_returns_accessible_values` retains the `getattr`
  + `__doc__` accessibility check as a secondary contract (every
  reported name must actually resolve to a value).

P3 (Docs/Tests): `_AGENT_FACING_ORDER` includes `DiagnosticReport`,
but the explicit `__all__` and callability tests covered only 5 of
the 6 names. A regression that left `DiagnosticReport` in `dir()` but
de-exported it would not have failed.

Fix: anchor both assertions to `_AGENT_FACING_ORDER` itself rather
than duplicating a hard-coded list:
- `test_agent_facing_names_in_all` now derives `required` from
  `_AGENT_FACING_ORDER`.
- `test_agent_facing_entrypoint_callable` parametrizes over
  `sorted(_AGENT_FACING_ORDER)` directly.
Future head-name additions automatically extend coverage instead of
needing two sources of truth.

Tests: 17 → 18 (DiagnosticReport added to callable parametrize).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: da184114f986c93cec0874dd5b7b798f65b8302f


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review: both prior findings are addressed. The new test_dir_returns_full_module_namespace() independently compares dir(diff_diff) to vars(diff_diff), removing the earlier tautological inspect.getmembers()-only check (tests/test_agent_discoverability.py:L113-L137).
  • Re-review: DiagnosticReport is now covered by the explicit public-surface assertions because both the __all__ and callability tests are anchored to _AGENT_FACING_ORDER, which includes it (tests/test_agent_discoverability.py:L39-L49, L278-L290; diff_diff/__init__.py:L507-L523, L529-L536).
  • No estimator, identification, weighting, variance/SE, or default-behavior code changed. The diff is limited to a new contract test plus CHANGELOG.md / TODO.md updates (CHANGELOG.md:L11-L14, TODO.md:L159-L166), so the Methodology Registry is not implicated.
  • The new assertions line up with the existing discoverability contract in diff_diff.__dir__() and agent_workflow() (diff_diff/__init__.py:L525-L589; diff_diff/agent_workflow.py:L124-L135, L153-L160, L212-L220).
  • Verification was static only: I could not run pytest or import diff_diff in this environment because numpy, pandas, and pytest are unavailable.

Methodology

  • Severity: None. Impact: No affected method in docs/methodology/REGISTRY.md; this PR does not modify estimator/reporting implementations, only discoverability regression coverage (docs/methodology/REGISTRY.md:L3791-L3800). Concrete fix: None.

Code Quality

  • Severity: None. Impact: The new test is anchored to source-of-truth symbols (_AGENT_FACING_ORDER, vars(diff_diff)) rather than duplicated literals, which reduces drift risk and resolves the earlier review gaps (tests/test_agent_discoverability.py:L39-L49, L80-L137, L278-L290). Concrete fix: None.

Performance

  • Severity: None. Impact: Added checks are lightweight import-time/introspection assertions over tiny in-memory objects; no production runtime path is changed (tests/test_agent_discoverability.py:L228-L290). Concrete fix: None.

Maintainability

  • Severity: None. Impact: Removing the deferred TODO row is justified because the new test now covers the deferred __dir__() contract, including the previously missing full-namespace proof (tests/test_agent_discoverability.py:L113-L155; TODO.md:L159-L166). Concrete fix: None.

Tech Debt

  • Severity: None. Impact: No new untracked debt is introduced by this diff. Concrete fix: None.

Security

  • Severity: None. Impact: The PR adds no subprocess, network, secret-handling, or PR-controlled execution behavior; the test remains purely in-process introspection (tests/test_agent_discoverability.py:L1-L21). Concrete fix: None.

Documentation/Tests

  • Severity: None. Impact: The changelog/TODO edits are consistent with the new test coverage, and the prior re-review findings are closed (CHANGELOG.md:L11-L14; tests/test_agent_discoverability.py:L113-L137, L278-L290). Concrete fix: None.
  • Severity: P3. Impact: Residual risk is limited to unexecuted verification in this environment; I could not run the test suite because numpy, pandas, and pytest are missing locally. Concrete fix: Run tests/test_agent_discoverability.py in the project’s normal CI environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cold-start agent regression test to CI

1 participant