feat(dashboard): surface verdict_source + per-run triggered_by badge#4263
Conversation
c71809a to
42e7f37
Compare
47b26d4 to
fd7d406
Compare
…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
54a4f16 to
f7f933b
Compare
Six changes responding to expert review: 1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_ primary_brand_domain.sql on main since this PR was opened). 2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance response. Heartbeat and owner_test both call comply() against the same registered URL with the same owner-saved credentials; the verdict's truth content is identical regardless of who pulled the trigger. Surfacing the source label to buyers creates a trust distinction the underlying observation doesn't carry. `triggered_by` stays as internal audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps the operator UX cue. 3. Drop the legacy `recordComplianceRun(... 'manual')` write inside evaluate_agent_quality. Pre-PR, that write was invisible publicly. Post-PR (with verdict_source on the response), it would have leaked `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1. Even with verdict_source removed (item 2), the legacy write was gated only on `agent_contexts` row existence — which `save_agent` lets any org create for any URL without an ownership check — so a non-owner could publish a verdict on someone else's agent. The owner-test branch covers the dashboard-freshness use case with a real ownership check; the legacy public-state write has no remaining function. Legacy agent_test_history write retained as session-scoped audit until Emma's PR 3 cleans up. 4. Dashboard "Run this storyboard" endpoint now writes `triggered_by = 'owner_test'` instead of `'manual'` (owner-only path, consistent with the new semantic). Legacy `'manual'` value remains in the enum for historical rows. 5. Rate limit on evaluate_agent_quality via existing Addie tool rate limiter (default 60/10min). comply() itself takes 10-60s per run so the natural ceiling is already ~1-2/min — this adds a hard wall above that. Security-reviewer's B2. 6. Updated changeset content + migration number references; replaces stale "migration 471" prose from earlier renumber. Out of scope (deferred follow-ups): - Badge issuance on owner_test transitions: owner currently waits up to 1h for next heartbeat to re-issue badge. Skipped because the badge fan-out in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its own refactor. - Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface call applies, kept inline to bound this PR's scope. - Active-membership filter: organization_memberships has no status column (row deletion = removal), so the security-review M2 concern doesn't apply at the schema layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce state (#4250) * fix(addie): owner evaluate_agent_quality writes to canonical compliance state Closes the 12-hour gap between owner-triggered storyboard runs and the public /api/registry/agents/:url/compliance endpoint (issue #4247, PR 1 of 4). When evaluate_agent_quality is triggered by the agent owner, the result is now written to agent_compliance_status + agent_compliance_runs + agent_storyboard_status with triggered_by = 'owner_test'. Non-owner runs continue writing to agent_test_history (deprecated in PR 3). Migration 471 adds 'owner_test' to both triggered_by CHECK constraints. notifyComplianceChange is intentionally suppressed for owner runs to prevent iteration-loop Slack spam. https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb * fix(addie): add verdict_source to compliance response + last-write-wins test Address review feedback from @EmmaLouise2018 on PR #4250: 1. `verdict_source` field on /api/registry/agents/:url/compliance — `AgentComplianceDetailSchema` gains optional `verdict_source`: 'heartbeat' | 'owner_test' | 'manual' | 'webhook' | null — `getComplianceStatus` and `bulkGetComplianceStatus` join `agent_compliance_runs` via LATERAL subquery (dry_run=false, ORDER BY tested_at DESC LIMIT 1) to surface the triggered_by of the most recent run. No migration needed. — Endpoint response emits `verdict_source: status.last_triggered_by`. — `AgentComplianceStatus` interface gets `last_triggered_by` field. 2. Last-write-wins contract test — New `compliance-db-last-write-wins.test.ts` pins the ON CONFLICT DO UPDATE semantics: every recordComplianceRun call overwrites agent_compliance_status regardless of triggered_by source. A future change to first-write-wins or priority ordering would break these tests. https://claude.ai/code/session_01NVVqgeSGevUGXgDbMw1XKZ * chore(dist): remove accidentally committed onboarding-openapi build artifacts Generated JS/TS files don't belong in source control. Also adds .gitignore rules for dist/schemas/*.{js,d.ts,*.map} to prevent recurrence. https://claude.ai/code/session_01WrFMahjaHU7y4JWqPqxTbx * fix(migrate): renumber 471 → 472 (resolve clash with manager_revalidation_queue on main) * fix(addie): address Brian's review — DDL lock guard + cross-org gap note * review iteration: address security + correctness feedback on #4250 Six changes responding to expert review: 1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_ primary_brand_domain.sql on main since this PR was opened). 2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance response. Heartbeat and owner_test both call comply() against the same registered URL with the same owner-saved credentials; the verdict's truth content is identical regardless of who pulled the trigger. Surfacing the source label to buyers creates a trust distinction the underlying observation doesn't carry. `triggered_by` stays as internal audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps the operator UX cue. 3. Drop the legacy `recordComplianceRun(... 'manual')` write inside evaluate_agent_quality. Pre-PR, that write was invisible publicly. Post-PR (with verdict_source on the response), it would have leaked `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1. Even with verdict_source removed (item 2), the legacy write was gated only on `agent_contexts` row existence — which `save_agent` lets any org create for any URL without an ownership check — so a non-owner could publish a verdict on someone else's agent. The owner-test branch covers the dashboard-freshness use case with a real ownership check; the legacy public-state write has no remaining function. Legacy agent_test_history write retained as session-scoped audit until Emma's PR 3 cleans up. 4. Dashboard "Run this storyboard" endpoint now writes `triggered_by = 'owner_test'` instead of `'manual'` (owner-only path, consistent with the new semantic). Legacy `'manual'` value remains in the enum for historical rows. 5. Rate limit on evaluate_agent_quality via existing Addie tool rate limiter (default 60/10min). comply() itself takes 10-60s per run so the natural ceiling is already ~1-2/min — this adds a hard wall above that. Security-reviewer's B2. 6. Updated changeset content + migration number references; replaces stale "migration 471" prose from earlier renumber. Out of scope (deferred follow-ups): - Badge issuance on owner_test transitions: owner currently waits up to 1h for next heartbeat to re-issue badge. Skipped because the badge fan-out in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its own refactor. - Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface call applies, kept inline to bound this PR's scope. - Active-membership filter: organization_memberships has no status column (row deletion = removal), so the security-review M2 concern doesn't apply at the schema layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Emma Mulitz <emulitz@scope3.com>
fd7d406 to
a40cddd
Compare
PR 2 of the #4247 unification stack. Reads two fields PR #4250 added to the compliance API but the dashboard wasn't yet rendering: - compliance tile: appends "(your test)" / "(heartbeat)" / "(manual)" / "(webhook)" after Last checked, so operators see whether the current verdict came from their own evaluate_agent_quality run or the scheduled heartbeat. - history panel: per-run badge with the same source label, info-blue for owner_test and neutral for the rest. Pre-PR-1 rows render with neutral — no regression. No backend changes; pure UI surfacing of fields already in the API. Stacked on PR #4250.
a40cddd to
48f96aa
Compare
|
Review iteration Reviewer flagged that gating Fix: added Updated Also trimmed the schema description per the reviewer's nit. |
#4378) (#4389) Adds 6 explicit unit tests to `tests/unit/membership-tiers.test.ts` covering the `is_owner` field that gates `verdict_source` exposure on the public `/api/registry/agents/:url/compliance` response. The matrix: - Anonymous (no userId) → is_owner: false - User not a member of any owning org → is_owner: false - Resolved org deleted (orphan profile) → is_owner: false - Owner on free Explorer tier → is_owner: true (broader than is_api_access_tier, which is false for Explorer — pins that free- tier owners still get verdict_source on their own dashboard) - Owner on API-access tier with active sub → is_owner: true - Owner with canceled sub → is_owner: true (ownership ≠ entitlement; the dashboard cue still works, badges don't until reactivation) Pre-PR #4263 review feedback flagged that gating verdict_source on is_api_access_tier (the previous design) was too narrow — free-tier owners would see null on their own agent. These tests anchor the correction (is_owner is the right gate) so a future refactor can't silently regress. Full route-level integration test (anonymous → null, cross-org → null, owner → populated through `request(app).get(...)`) deferred — requires DB-fixture scaffolding not currently in place for this endpoint. Follow-up will set up the harness and write those. Partial completion of #4378. Leaves the issue open for the integration-test follow-up. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 3 of the #4247 unification stack. Migration 472 backfills every agent_test_history row with a user_id into agent_compliance_runs as triggered_by='owner_test', carrying the source id in observations_json.backfill_source_id for idempotent re-runs. Each agent's latest backfilled row upserts into agent_compliance_status so the dashboard immediately reflects a real verdict for agents tested through Addie pre-PR-#4250. evaluate_agent_quality stops calling recordTest() when the caller owns the agent — that was the dual-write that #4247 is closing. recordTest is retained ONLY for third-party runs so strangers testing someone else's agent still have a session-scoped audit trail. Drop of agent_test_history table is deferred behind the 14-day soak from #4250 + 7-day soak from #4263 + S3 cold-storage export of non-owner rows. Migration 472 documents this in its trailing comment. Stacked on #4263 → #4250.
PR 2 of the #4247 unification stack. Stacked on #4250.
Summary
PR #4250 added
verdict_sourceto/api/registry/agents/:url/complianceandtriggered_byto each row returned by/api/registry/agents/:url/compliance/history. The dashboard wasn't rendering either. This PR surfaces both so operators can distinguish their own on-demand runs from scheduled heartbeat verdicts at a glance.Why
Brian's review on #4250 (and the #4247 plan): the public compliance contract is shifting from "last scheduled verdict" to "last verdict from any source." Telling the caller — including the operator viewing their own dashboard — that the current verdict is from
your testvsheartbeatis the load-bearing UX clarification. Without it, an operator runsevaluate_agent_quality, sees the dashboard tile flip to passing, and has no signal that the public registry now reflects their on-demand run rather than the cron's verdict.What changes
(your test)/(heartbeat)/(manual)/(webhook)afterLast checked: 3m ago. Empty string whenverdict_sourceis null (never run).triggered_by = 'owner_test'; neutral forheartbeat/manual/webhookso pre-PR-1 rows render without regression.No backend changes. Pure UI on fields the API already emits.
Stacked on
verdict_sourceandtriggered_byto the API responsesThis PR's diff is read against #4250's branch. Merge order: #4250 → this PR.
Out of scope (later PRs in the #4247 stack)
agent_test_history, backfill owner-triggered rows intoagent_compliance_runs, S3-export third-party rows. Destructive migration; soaks behind PR 2.agent_contexts.last_test_*columns into a derived view. Pure schema cleanup.Test plan
tsc --noEmit -p server/tsconfig.jsoncleanevaluate_agent_quality, observe dashboard compliance tile shows(your test). Wait for next heartbeat, observe it flips to(heartbeat).Your testbadge, heartbeat runs render with neutral badge.triggered_byvalue, or only'heartbeat') render cleanly without empty/garbage labels.