Skip to content

test(comply): regression guard for extractFailures attribution parity (#1708)#1715

Merged
bokelley merged 1 commit into
mainfrom
bokelley/issue-1708-parity-guard
May 12, 2026
Merged

test(comply): regression guard for extractFailures attribution parity (#1708)#1715
bokelley merged 1 commit into
mainfrom
bokelley/issue-1708-parity-guard

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

Closes #1708.

Why this PR exists

The original #1708 framing — "cross-evaluator divergence (comply suite vs CLI runner) is a load-bearing higher-order bug" — got resolved differently than expected. The 27-point delta turned out to be version-driven, not evaluator-driven: the comply suite was on an older @adcp/sdk that hit two bugs the CLI runner's newer SDK didn't. Both bugs shipped in 7.1.0:

BidMachine retested on 7.1.0 (receipt) and the delta collapsed.

So #1708 doesn't need a "parity smoke test against a live agent" — it needs a regression guard for the aggregation layer so future refactors of extractFailures can't silently reintroduce the misattribution shape.

What's locked

extractFailures in comply.ts is what walks StoryboardResult[] and emits ComplianceFailure[] for the final report. Five invariants are now under test:

  1. Schema-reject attribution preserved (Harness error attribution: Zod validation rejects surface as failures on unrelated downstream assertions #1709) — when a step has both a response_schema failure (prepended by the runner) and an assertion failure (a downstream invariant that ran), extractFailures surfaces validation.check === 'response_schema'. If a future refactor reorders or picks the wrong validation, this fails.

  2. Skip markers filtered out (fix(runner): attribute Zod schema rejects to response_schema (#1709) #1712 invariant) — when the runner short-circuits step-scope invariants and emits { check: 'assertion', passed: true, description: '<id>: skipped — ...' } markers, extractFailures doesn't surface them. Only failed validations make the cut.

  3. BidMachine-shape clean pass (no_secret_echo invariant flags spec-legitimate field names on structured-value fields #1713) — a step whose no_secret_echo invariant passed (structured authorization value, post-fix(invariant): no_secret_echo only fails on string-valued suspect-named fields (#1713) #1714 narrowing) and whose schema passed produces zero entries in failures. The aggregation must not synthesize spurious failures.

  4. Multi-storyboard attribution — failed A, clean B, failed C produces exactly two failures entries with stable storyboard_id × validation.check tuples.

  5. Skipped steps not countedstep.skipped: true entries don't surface as failures even if passed: false in some path.

API change

extractFailures is now export-ed from src/lib/testing/compliance/comply.ts (was file-internal). Visibility-only change; signature and behavior unchanged. The export lets the parity test call it directly with synthetic StoryboardResult fixtures, avoiding the need to spin up a mock HTTP MCP server.

Test plan

  • npm run build clean
  • npm run format:check clean
  • node --test test/lib/comply-vs-storyboard-parity.test.js — 7/7 tests pass across 5 describe blocks

Coordinated stance state after this PR

adcp-client #1703 merged (in 7.1.0)
adcp-client #1705 merged (in 7.1.0)
adcp-client #1706 merged (in 7.1.0)
adcp-client #1712 (#1709) merged (in 7.1.0)
adcp-client #1714 (#1713) merged (in 7.1.0)
adcp-client #1707 scope-corrected; parked for adopter demand
adcp-client #1708 (this PR) closes after merge — regression guard for the above
adcp-client #1711 closed (BidMachine retest confirmed fix)

Part of the #1685 coordinated stance ("the SDK is a witness, not a translator").

🤖 Generated with Claude Code

…#1708)

Locks the post-7.1.0 attribution invariants so future refactors of
comply()'s extractFailures can't silently reintroduce the BidMachine
misattribution shape (adcp#4419).

What's locked:

1. A storyboard step carrying both a synthesized response_schema failure
   (prepended by the runner per #1709 / PR #1712) and an assertion entry
   surfaces validation.check === 'response_schema' in
   ComplianceResult.failures — never 'assertion'. The attribution that
   was silently broken pre-7.1.0 (Zod rejects fell through to the next
   invariant, canonically context.no_secret_echo).

2. Skipped-invariant markers (passed: true entries the runner emits when
   short-circuiting invariants downstream of a schema failure per #1712)
   are correctly filtered out — only failed validations surface in
   `failures`. A future change that included passed: true entries would
   crowd out the real failure.

3. A clean BidMachine-shape response (structured authorization field
   passing no_secret_echo per #1713 / PR #1714) produces zero failures
   through the aggregation layer.

4. Multi-storyboard aggregation preserves per-storyboard
   (storyboard_id, step_id, validation.check) tuples.

5/6. Clean pass paths (no failures, skipped steps) produce empty failures.

API change (minor): extractFailures (previously file-internal) is now
exported from src/lib/testing/compliance/comply.ts so the regression
test can call it directly with synthetic StoryboardResult fixtures.
Functionally identical; just visibility.

Scope correction relative to the original #1708 framing: the
"cross-evaluator divergence" symptom was version-driven (different
@adcp/sdk versions hitting #1713 and #1709 differently), not a true
parity gap. Both root causes shipped in 7.1.0; this test is the
durable guard for the aggregation-layer invariants those fixes
depend on.

7/7 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit c566612 into main May 12, 2026
10 checks passed
@bokelley bokelley deleted the bokelley/issue-1708-parity-guard branch May 12, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluator divergence: comply suite and CLI runner produce materially different grades on the same agent

1 participant