test(comply): regression guard for extractFailures attribution parity (#1708)#1715
Merged
Conversation
…#1708) Locks the post-7.1.0 attribution invariants so future refactors of comply()'s extractFailures can't silently reintroduce the BidMachine misattribution shape (adcp#4419). What's locked: 1. A storyboard step carrying both a synthesized response_schema failure (prepended by the runner per #1709 / PR #1712) and an assertion entry surfaces validation.check === 'response_schema' in ComplianceResult.failures — never 'assertion'. The attribution that was silently broken pre-7.1.0 (Zod rejects fell through to the next invariant, canonically context.no_secret_echo). 2. Skipped-invariant markers (passed: true entries the runner emits when short-circuiting invariants downstream of a schema failure per #1712) are correctly filtered out — only failed validations surface in `failures`. A future change that included passed: true entries would crowd out the real failure. 3. A clean BidMachine-shape response (structured authorization field passing no_secret_echo per #1713 / PR #1714) produces zero failures through the aggregation layer. 4. Multi-storyboard aggregation preserves per-storyboard (storyboard_id, step_id, validation.check) tuples. 5/6. Clean pass paths (no failures, skipped steps) produce empty failures. API change (minor): extractFailures (previously file-internal) is now exported from src/lib/testing/compliance/comply.ts so the regression test can call it directly with synthetic StoryboardResult fixtures. Functionally identical; just visibility. Scope correction relative to the original #1708 framing: the "cross-evaluator divergence" symptom was version-driven (different @adcp/sdk versions hitting #1713 and #1709 differently), not a true parity gap. Both root causes shipped in 7.1.0; this test is the durable guard for the aggregation-layer invariants those fixes depend on. 7/7 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1708.
Why this PR exists
The original #1708 framing — "cross-evaluator divergence (comply suite vs CLI runner) is a load-bearing higher-order bug" — got resolved differently than expected. The 27-point delta turned out to be version-driven, not evaluator-driven: the comply suite was on an older
@adcp/sdkthat hit two bugs the CLI runner's newer SDK didn't. Both bugs shipped in7.1.0:no_secret_echoinvariant flagged the spec-legitimateauthorizationfield regardless of value.BidMachine retested on 7.1.0 (receipt) and the delta collapsed.
So #1708 doesn't need a "parity smoke test against a live agent" — it needs a regression guard for the aggregation layer so future refactors of
extractFailurescan't silently reintroduce the misattribution shape.What's locked
extractFailuresincomply.tsis what walksStoryboardResult[]and emitsComplianceFailure[]for the final report. Five invariants are now under test:Schema-reject attribution preserved (Harness error attribution: Zod validation rejects surface as failures on unrelated downstream assertions #1709) — when a step has both a
response_schemafailure (prepended by the runner) and anassertionfailure (a downstream invariant that ran),extractFailuressurfacesvalidation.check === 'response_schema'. If a future refactor reorders or picks the wrong validation, this fails.Skip markers filtered out (fix(runner): attribute Zod schema rejects to response_schema (#1709) #1712 invariant) — when the runner short-circuits step-scope invariants and emits
{ check: 'assertion', passed: true, description: '<id>: skipped — ...' }markers,extractFailuresdoesn't surface them. Only failed validations make the cut.BidMachine-shape clean pass (no_secret_echo invariant flags spec-legitimate field names on structured-value fields #1713) — a step whose
no_secret_echoinvariant passed (structuredauthorizationvalue, post-fix(invariant): no_secret_echo only fails on string-valued suspect-named fields (#1713) #1714 narrowing) and whose schema passed produces zero entries infailures. The aggregation must not synthesize spurious failures.Multi-storyboard attribution — failed A, clean B, failed C produces exactly two
failuresentries with stablestoryboard_id×validation.checktuples.Skipped steps not counted —
step.skipped: trueentries don't surface as failures even ifpassed: falsein some path.API change
extractFailuresis nowexport-ed fromsrc/lib/testing/compliance/comply.ts(was file-internal). Visibility-only change; signature and behavior unchanged. The export lets the parity test call it directly with syntheticStoryboardResultfixtures, avoiding the need to spin up a mock HTTP MCP server.Test plan
npm run buildcleannpm run format:checkcleannode --test test/lib/comply-vs-storyboard-parity.test.js— 7/7 tests pass across 5 describe blocksCoordinated stance state after this PR
Part of the #1685 coordinated stance ("the SDK is a witness, not a translator").
🤖 Generated with Claude Code