feat(dashboard): surface verdict_source + per-run triggered_by badge by EmmaLouise2018 · Pull Request #4263 · adcontextprotocol/adcp

EmmaLouise2018 · 2026-05-08T20:20:55Z

PR 2 of the #4247 unification stack. Stacked on #4250.

Summary

PR #4250 added verdict_source to /api/registry/agents/:url/compliance and triggered_by to each row returned by /api/registry/agents/:url/compliance/history. The dashboard wasn't rendering either. This PR surfaces both so operators can distinguish their own on-demand runs from scheduled heartbeat verdicts at a glance.

Why

Brian's review on #4250 (and the #4247 plan): the public compliance contract is shifting from "last scheduled verdict" to "last verdict from any source." Telling the caller — including the operator viewing their own dashboard — that the current verdict is from your test vs heartbeat is the load-bearing UX clarification. Without it, an operator runs evaluate_agent_quality, sees the dashboard tile flip to passing, and has no signal that the public registry now reflects their on-demand run rather than the cron's verdict.

What changes

Compliance tile: appends (your test) / (heartbeat) / (manual) / (webhook) after Last checked: 3m ago. Empty string when verdict_source is null (never run).
History panel: per-run badge with the same source label. Info-blue for triggered_by = 'owner_test'; neutral for heartbeat / manual / webhook so pre-PR-1 rows render without regression.

No backend changes. Pure UI on fields the API already emits.

Stacked on

fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 (PR 1 of Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247) — adds verdict_source and triggered_by to the API responses

This PR's diff is read against #4250's branch. Merge order: #4250 → this PR.

Out of scope (later PRs in the #4247 stack)

PR 3 — drop agent_test_history, backfill owner-triggered rows into agent_compliance_runs, S3-export third-party rows. Destructive migration; soaks behind PR 2.
PR 4 — collapse agent_contexts.last_test_* columns into a derived view. Pure schema cleanup.

Test plan

tsc --noEmit -p server/tsconfig.json clean
Build passes
Manual smoke after deploy: register agent, run evaluate_agent_quality, observe dashboard compliance tile shows (your test). Wait for next heartbeat, observe it flips to (heartbeat).
Manual smoke: open History panel, observe owner-triggered runs render with blue Your test badge, heartbeat runs render with neutral badge.
Pre-PR-1 rows (no triggered_by value, or only 'heartbeat') render cleanly without empty/garbage labels.

bokelley · 2026-05-09T00:02:53Z

Code review (expert pass): clean. Will cascade after #4250 strips its committed dist/ files and #4264 fixes the dual-write-window dup-guard.

…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six changes responding to expert review: 1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_ primary_brand_domain.sql on main since this PR was opened). 2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance response. Heartbeat and owner_test both call comply() against the same registered URL with the same owner-saved credentials; the verdict's truth content is identical regardless of who pulled the trigger. Surfacing the source label to buyers creates a trust distinction the underlying observation doesn't carry. `triggered_by` stays as internal audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps the operator UX cue. 3. Drop the legacy `recordComplianceRun(... 'manual')` write inside evaluate_agent_quality. Pre-PR, that write was invisible publicly. Post-PR (with verdict_source on the response), it would have leaked `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1. Even with verdict_source removed (item 2), the legacy write was gated only on `agent_contexts` row existence — which `save_agent` lets any org create for any URL without an ownership check — so a non-owner could publish a verdict on someone else's agent. The owner-test branch covers the dashboard-freshness use case with a real ownership check; the legacy public-state write has no remaining function. Legacy agent_test_history write retained as session-scoped audit until Emma's PR 3 cleans up. 4. Dashboard "Run this storyboard" endpoint now writes `triggered_by = 'owner_test'` instead of `'manual'` (owner-only path, consistent with the new semantic). Legacy `'manual'` value remains in the enum for historical rows. 5. Rate limit on evaluate_agent_quality via existing Addie tool rate limiter (default 60/10min). comply() itself takes 10-60s per run so the natural ceiling is already ~1-2/min — this adds a hard wall above that. Security-reviewer's B2. 6. Updated changeset content + migration number references; replaces stale "migration 471" prose from earlier renumber. Out of scope (deferred follow-ups): - Badge issuance on owner_test transitions: owner currently waits up to 1h for next heartbeat to re-issue badge. Skipped because the badge fan-out in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its own refactor. - Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface call applies, kept inline to bound this PR's scope. - Active-membership filter: organization_memberships has no status column (row deletion = removal), so the security-review M2 concern doesn't apply at the schema layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@EmmaLouise2018

…ce state (#4250) * fix(addie): owner evaluate_agent_quality writes to canonical compliance state Closes the 12-hour gap between owner-triggered storyboard runs and the public /api/registry/agents/:url/compliance endpoint (issue #4247, PR 1 of 4). When evaluate_agent_quality is triggered by the agent owner, the result is now written to agent_compliance_status + agent_compliance_runs + agent_storyboard_status with triggered_by = 'owner_test'. Non-owner runs continue writing to agent_test_history (deprecated in PR 3). Migration 471 adds 'owner_test' to both triggered_by CHECK constraints. notifyComplianceChange is intentionally suppressed for owner runs to prevent iteration-loop Slack spam. https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb * fix(addie): add verdict_source to compliance response + last-write-wins test Address review feedback from @EmmaLouise2018 on PR #4250: 1. `verdict_source` field on /api/registry/agents/:url/compliance — `AgentComplianceDetailSchema` gains optional `verdict_source`: 'heartbeat' | 'owner_test' | 'manual' | 'webhook' | null — `getComplianceStatus` and `bulkGetComplianceStatus` join `agent_compliance_runs` via LATERAL subquery (dry_run=false, ORDER BY tested_at DESC LIMIT 1) to surface the triggered_by of the most recent run. No migration needed. — Endpoint response emits `verdict_source: status.last_triggered_by`. — `AgentComplianceStatus` interface gets `last_triggered_by` field. 2. Last-write-wins contract test — New `compliance-db-last-write-wins.test.ts` pins the ON CONFLICT DO UPDATE semantics: every recordComplianceRun call overwrites agent_compliance_status regardless of triggered_by source. A future change to first-write-wins or priority ordering would break these tests. https://claude.ai/code/session_01NVVqgeSGevUGXgDbMw1XKZ * chore(dist): remove accidentally committed onboarding-openapi build artifacts Generated JS/TS files don't belong in source control. Also adds .gitignore rules for dist/schemas/*.{js,d.ts,*.map} to prevent recurrence. https://claude.ai/code/session_01WrFMahjaHU7y4JWqPqxTbx * fix(migrate): renumber 471 → 472 (resolve clash with manager_revalidation_queue on main) * fix(addie): address Brian's review — DDL lock guard + cross-org gap note * review iteration: address security + correctness feedback on #4250 Six changes responding to expert review: 1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_ primary_brand_domain.sql on main since this PR was opened). 2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance response. Heartbeat and owner_test both call comply() against the same registered URL with the same owner-saved credentials; the verdict's truth content is identical regardless of who pulled the trigger. Surfacing the source label to buyers creates a trust distinction the underlying observation doesn't carry. `triggered_by` stays as internal audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps the operator UX cue. 3. Drop the legacy `recordComplianceRun(... 'manual')` write inside evaluate_agent_quality. Pre-PR, that write was invisible publicly. Post-PR (with verdict_source on the response), it would have leaked `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1. Even with verdict_source removed (item 2), the legacy write was gated only on `agent_contexts` row existence — which `save_agent` lets any org create for any URL without an ownership check — so a non-owner could publish a verdict on someone else's agent. The owner-test branch covers the dashboard-freshness use case with a real ownership check; the legacy public-state write has no remaining function. Legacy agent_test_history write retained as session-scoped audit until Emma's PR 3 cleans up. 4. Dashboard "Run this storyboard" endpoint now writes `triggered_by = 'owner_test'` instead of `'manual'` (owner-only path, consistent with the new semantic). Legacy `'manual'` value remains in the enum for historical rows. 5. Rate limit on evaluate_agent_quality via existing Addie tool rate limiter (default 60/10min). comply() itself takes 10-60s per run so the natural ceiling is already ~1-2/min — this adds a hard wall above that. Security-reviewer's B2. 6. Updated changeset content + migration number references; replaces stale "migration 471" prose from earlier renumber. Out of scope (deferred follow-ups): - Badge issuance on owner_test transitions: owner currently waits up to 1h for next heartbeat to re-issue badge. Skipped because the badge fan-out in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its own refactor. - Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface call applies, kept inline to bound this PR's scope. - Active-membership filter: organization_memberships has no status column (row deletion = removal), so the security-review M2 concern doesn't apply at the schema layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Emma Mulitz <emulitz@scope3.com>

PR 2 of the #4247 unification stack. Reads two fields PR #4250 added to the compliance API but the dashboard wasn't yet rendering: - compliance tile: appends "(your test)" / "(heartbeat)" / "(manual)" / "(webhook)" after Last checked, so operators see whether the current verdict came from their own evaluate_agent_quality run or the scheduled heartbeat. - history panel: per-run badge with the same source label, info-blue for owner_test and neutral for the rest. Pre-PR-1 rows render with neutral — no regression. No backend changes; pure UI surfacing of fields already in the API. Stacked on PR #4250.

bokelley · 2026-05-11T10:00:50Z

Review iteration 48f96aa:

Reviewer flagged that gating verdict_source on is_api_access_tier was too narrow — free-tier (Explorer) owners would see null on their own agent's dashboard and miss the "(your test)" cue.

Fix: added is_owner: boolean to the OwnerMembership internal struct in services/membership-tiers.ts. Gate verdict_source on ownerMembership.is_owner (any owner, any tier) instead. Other owner-scoped fields (membership_tier, subscription_status) are unchanged — they still populate per the existing semantics. is_owner stays internal (not in the API response) so the owner-detect-resistant response-shape property is preserved.

Updated membership-tiers.test.ts to include is_owner in the expected shapes; 21/21 passing.

Also trimmed the schema description per the reviewer's nit.

#4378) (#4389) Adds 6 explicit unit tests to `tests/unit/membership-tiers.test.ts` covering the `is_owner` field that gates `verdict_source` exposure on the public `/api/registry/agents/:url/compliance` response. The matrix: - Anonymous (no userId) → is_owner: false - User not a member of any owning org → is_owner: false - Resolved org deleted (orphan profile) → is_owner: false - Owner on free Explorer tier → is_owner: true (broader than is_api_access_tier, which is false for Explorer — pins that free- tier owners still get verdict_source on their own dashboard) - Owner on API-access tier with active sub → is_owner: true - Owner with canceled sub → is_owner: true (ownership ≠ entitlement; the dashboard cue still works, badges don't until reactivation) Pre-PR #4263 review feedback flagged that gating verdict_source on is_api_access_tier (the previous design) was too narrow — free-tier owners would see null on their own agent. These tests anchor the correction (is_owner is the right gate) so a future refactor can't silently regress. Full route-level integration test (anonymous → null, cross-org → null, owner → populated through `request(app).get(...)`) deferred — requires DB-fixture scaffolding not currently in place for this endpoint. Follow-up will set up the harness and write those. Partial completion of #4378. Leaves the issue open for the integration-test follow-up. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR 3 of the #4247 unification stack. Migration 472 backfills every agent_test_history row with a user_id into agent_compliance_runs as triggered_by='owner_test', carrying the source id in observations_json.backfill_source_id for idempotent re-runs. Each agent's latest backfilled row upserts into agent_compliance_status so the dashboard immediately reflects a real verdict for agents tested through Addie pre-PR-#4250. evaluate_agent_quality stops calling recordTest() when the caller owns the agent — that was the dual-write that #4247 is closing. recordTest is retained ONLY for third-party runs so strangers testing someone else's agent still have a session-scoped audit trail. Drop of agent_test_history table is deferred behind the 14-day soak from #4250 + 7-day soak from #4263 + S3 cold-storage export of non-owner rows. Migration 472 documents this in its trailing comment. Stacked on #4263 → #4250.

This was referenced May 8, 2026

fix(addie): backfill owner test history + stop dual-write for owner runs #4264

Open

feat(compliance): derive agent_context.last_test_* from canonical runs #4268

Open

chore(compliance): drop agent_test_history + last_test_* — final cleanup #4274

Open

bokelley mentioned this pull request May 9, 2026

fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250

Merged

EmmaLouise2018 force-pushed the claude/issue-4247-owner-test-canonical-write branch from c71809a to 42e7f37 Compare May 9, 2026 00:09

EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from 47b26d4 to fd7d406 Compare May 9, 2026 01:07

bokelley mentioned this pull request May 11, 2026

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys #4364

Merged

bokelley force-pushed the claude/issue-4247-owner-test-canonical-write branch from 54a4f16 to f7f933b Compare May 11, 2026 09:10

Base automatically changed from claude/issue-4247-owner-test-canonical-write to main May 11, 2026 09:45

bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from fd7d406 to a40cddd Compare May 11, 2026 09:54

bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from a40cddd to 48f96aa Compare May 11, 2026 10:00

bokelley merged commit 3f160f0 into main May 11, 2026
13 checks passed

bokelley deleted the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch May 11, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboard): surface verdict_source + per-run triggered_by badge#4263

feat(dashboard): surface verdict_source + per-run triggered_by badge#4263
bokelley merged 1 commit into
mainfrom
EmmaLouise2018/unification-pr2-dashboard-triggered-by

EmmaLouise2018 commented May 8, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmmaLouise2018 commented May 8, 2026

Summary

Why

What changes

Stacked on

Out of scope (later PRs in the #4247 stack)

Test plan

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants