Skip to content

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys#4364

Merged
bokelley merged 3 commits into
mainfrom
bokelley/fix-storyboard-status-scenario-keys
May 11, 2026
Merged

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys#4364
bokelley merged 3 commits into
mainfrom
bokelley/fix-storyboard-status-scenario-keys

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

Summary

The compliance heartbeat has been writing zero rows to agent_storyboard_status since comply() switched to storyboard-driven testing. Every agent's dashboard storyboards_passing: 0/N was misleading — the parser was dropping the results.

Root cause

SDK 6.x emits one TestResult per phase of each storyboard, keyed <storyboard_id>/<phase_id> in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts:54). The old deriveStoryboardStatuses walked the YAML's per-step comply_scenario field (bare names like signals_flow, capability_discovery) and looked them up in a Map keyed by the SDK's scenario strings. Every lookup missed → testedCount === 0 → every storyboard skipped at the continue guard. No rows written. No badges issued.

Scope

Direct DB query against prod:

agent_storyboard_status total rows: 6 (across 4 agents)
rows written by triggered_by='heartbeat': 0
existing rows: legacy bare-name keys from old manual runs (signals_baseline,
              capability_discovery, behavioral_analysis)

No heartbeat run has ever written to agent_storyboard_status. Affects every agent's badge eligibility, every specialism_status value, every storyboards_passing count.

Surfaced by escalation #329 — Evgeny's agent runs 30/30 scenarios clean but shows degraded because specialism_status.signal-owned = "untested" reads from a never-populated row.

Fix

Read SDK output directly. Group result.tracks[].scenarios[] by <storyboard_id> parsed from the scenario string, roll per-step pass counts up from each phase's steps array, fall back to phase-level counts when steps are absent. storyboardIds override is preserved for explicit-IDs callers (manual evals that need an untested entry when the runner didn't run a requested storyboard).

The YAML's comply_scenario field is no longer load-bearing for status mapping — the SDK already knows which storyboards it ran. Field is left in place (still useful for human documentation / planning).

Tests

server/tests/unit/derive-storyboard-statuses.test.ts — 9 cases:

  • all-pass → status='passing', step counts roll up
  • mixed pass/fail across phases → 'partial'
  • all-fail → 'failing'
  • phases without steps array → fall back to phase-level counts
  • legacy bare-name scenarios → skipped
  • empty input → []
  • explicit storyboardIds with runner gap → 'untested' entry
  • explicit storyboardIds ignoring extra storyboards in result
  • multi-storyboard heartbeat result

All passing locally; type check clean.

Stack note

Orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which is about which tables carry compliance state (collapsing agent_test_history). This PR is about parsing the SDK output correctly into the existing tables. Different files (compliance-testing.ts vs her compliance-db.ts / member-tools.ts / registry-api.ts); rebases cleanly in either order.

Follow-ups

Once merged + deployed, the next heartbeat tick will start populating agent_storyboard_status properly for all 18 registered agents. A heartbeat sweep cycle drains in ~2h. After that, badge issuance via processAgentBadges should re-fire on the next heartbeat for any agent whose declared specialisms now have passing storyboard rows.

🤖 Generated with Claude Code

bokelley and others added 3 commits May 11, 2026 04:08
…o keys

The compliance heartbeat has been writing zero rows to
agent_storyboard_status since the SDK switched comply() to storyboard-
driven testing. The SDK emits one TestResult per phase of each storyboard,
keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario
(see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation
walked the YAML's per-step `comply_scenario` field (bare names like
`signals_flow`, `capability_discovery`) and looked them up in the SDK's
scenario map. Every lookup missed → testedCount === 0 → every storyboard
skipped at the `continue` guard.

Effect across the registry:
  agent_storyboard_status total rows: 6  (across 4 agents)
  rows written by triggered_by='heartbeat': 0
  rows surviving were legacy bare-name keys from old manual runs

This silently broke the AAO Verified badge pipeline (no storyboard rows
→ deriveVerificationStatus has nothing to verify against) and every
agent's dashboard `storyboards_passing: 0 / N` was misleading: the
runner wasn't failing storyboards, the parser was dropping them.

Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios
clean but showing `degraded` because specialism_status.signal-owned read
'untested' from a never-populated agent_storyboard_status row.

Fix: read SDK output directly. Group scenarios by storyboard id, roll
per-step pass counts up from each phase's `steps` array, fall back to
phase-level counts when steps are absent. The `storyboardIds` override
is preserved for explicit-IDs callers that need an `untested` entry
when the runner didn't run a requested storyboard. The unused YAML
`comply_scenario` field is no longer load-bearing for status mapping
(the SDK already knows which storyboards it ran).

Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback,
legacy bare-name skip, empty input, and explicit-IDs untested gap.

Stack note: this is orthogonal to Emma's #4247 compliance-state
unification stack (#4250, #4263, #4264, #4268, #4274) which collapses
agent_test_history into agent_compliance_runs. Different files; rebases
cleanly in either order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he fix

Runs comply() against an agent URL and prints what
deriveStoryboardStatuses would produce, without DB writes. Used to
validate the SDK-6.x scenario-key fix against real agents
(adcp-signals-adaptor.evgeny-193.workers.dev/mcp and
wonderstruck.sales-agent.scope3.com/mcp) before merging.

Will stay useful for future SDK upgrades that touch scenario emission
or storyboard-track aggregation — same pattern as the
diagnose-agent-comply-queue script from #4361.

Usage:
  npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-ids check, add 3 edge tests

Addresses code-reviewer feedback on PR #4364:
- JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total
  are not directly comparable across rows (some rows are real step counts,
  some are phase-level fallbacks when the SDK omits per-step data).
- Comment pinning the storyboard-id invariant (flat ids, no `/`) so the
  indexOf split stays correct as new storyboards land.
- Defensive `result.tracks ?? []` so a malformed result doesn't throw.
- Hoist `storyboardIds && length > 0` into a single `hasExplicitIds`
  const used at both the toEmit decision and the no-data fallback.
- Three new test cases:
  * same storyboard split across multiple tracks aggregates correctly
  * result.tracks absent → []
  * non-string scenario values (null, number) → skipped without throwing

12/12 vitest passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley
Copy link
Copy Markdown
Contributor Author

Expert review pass — both clear

code-reviewer

No blockers. All actionable feedback addressed in c11abeb:

  • JSDoc clarifies that steps_passed/steps_total are not cross-row comparable (some rows real step counts, some phase-fallback)
  • Storyboard-id invariant (flat ids, no /) pinned in a comment
  • result.tracks ?? [] defensive guard
  • hasExplicitIds hoisted out of repeated check
  • 3 new test cases: same-storyboard-across-tracks aggregation / result.tracks absent / non-string scenario value
  • Changeset empty-frontmatter is intentional (matches repo pattern for non-version-bump changes)

12/12 vitest passing.

adtech-product-expert

Verdict: ship, with two product-side follow-ups (not blockers):

  1. Vendor notice before deploy. 18 registered agents is hand-emailable. Frame as fidelity improvement, not regression — get ahead of the Evgeny-shape failure mode (vendor publicly claims "30/30 passing", dashboard contradicts them next heartbeat).
  2. 30/30 vs storyboard-percentage dissonance is a real UX gap that this PR doesn't solve. Track summary counts steps executed; storyboard percentage counts steps passing inside a protocol bundle. Long-term: deprecate the track summary on the public dashboard, keep it in operator/debug view. Short-term: tooltip on the storyboard column. Follow-up issue, not a blocker.
  3. Buy-side consumers: shape unchanged, values shift. Scope3 is the only known consumer at scale, and Wonderstruck (their agent) comes out better (10/10 on declared specialism). Low risk.
  4. Badge issuance: "let the heartbeat run" — 2h drain is the staged rollout. Have someone watching first sweep.

Plan

  • Merge once CI green
  • Open follow-up issue for the dashboard reconciliation (track vs storyboard surface)
  • Vendor outreach happens out-of-band before/after deploy per product team

Trust-but-verify quotes:

  • sales_non_guaranteed for wonderstruck: 6 phases × 1 step in YAML = 6 declared; SDK runtime adds setup/seeding → 10 total → 10/10 passing. That's the specialism storyboard. The deeper protocol baseline (media_buy_seller, 89 steps) shows partial 42/89 — that's the real test depth.

@bokelley
Copy link
Copy Markdown
Contributor Author

Got it — both reviewers clear, merge plan looks solid. Happy to open the follow-up issue for the track-summary vs. storyboard-percentage dashboard reconciliation if you want to hand that off; just say the word.


Generated by Claude Code

@bokelley bokelley merged commit e2e14a6 into main May 11, 2026
14 checks passed
@bokelley bokelley deleted the bokelley/fix-storyboard-status-scenario-keys branch May 11, 2026 08:55
bokelley added a commit that referenced this pull request May 11, 2026
…4374)

Adds an "X / Y storyboards passing" element between the SDK headline
("2 silent" etc.) and the track pills, with a tooltip explaining the
relationship:

  storyboards = canonical conformance unit
                (each applicable specialism + protocol baseline +
                universal check is one storyboard, pass or fail)
  track pills = SDK's coarse roll-up that can read as "passing" even
                when underlying storyboards are partial — useful for
                quick glance but misleading in isolation

Track pills gain their own tooltip pointing readers at the Verification
panel for per-storyboard detail.

Resolves the Evgeny-shape disconnect from escalation #329: track
summary showed "2 silent / 30 of 30 scenarios passing" while the
agent's signal_owned specialism storyboard was 1/5 steps. With the
data flowing correctly after PR #4364, this surface change closes the
loop on the adtech-product reviewer's "deprecate track summary on the
public dashboard, keep it operator-only" call by making the storyboard
count visually prominent and clarifying that the SDK track pills are
debug context.

Push A item 4 of 4 in the compliance reporting fidelity initiative.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant