Skip to content

feat(dashboard): surface verdict_source + per-run triggered_by badge#4263

Merged
bokelley merged 1 commit into
mainfrom
EmmaLouise2018/unification-pr2-dashboard-triggered-by
May 11, 2026
Merged

feat(dashboard): surface verdict_source + per-run triggered_by badge#4263
bokelley merged 1 commit into
mainfrom
EmmaLouise2018/unification-pr2-dashboard-triggered-by

Conversation

@EmmaLouise2018
Copy link
Copy Markdown
Contributor

PR 2 of the #4247 unification stack. Stacked on #4250.

Summary

PR #4250 added verdict_source to /api/registry/agents/:url/compliance and triggered_by to each row returned by /api/registry/agents/:url/compliance/history. The dashboard wasn't rendering either. This PR surfaces both so operators can distinguish their own on-demand runs from scheduled heartbeat verdicts at a glance.

Why

Brian's review on #4250 (and the #4247 plan): the public compliance contract is shifting from "last scheduled verdict" to "last verdict from any source." Telling the caller — including the operator viewing their own dashboard — that the current verdict is from your test vs heartbeat is the load-bearing UX clarification. Without it, an operator runs evaluate_agent_quality, sees the dashboard tile flip to passing, and has no signal that the public registry now reflects their on-demand run rather than the cron's verdict.

What changes

  • Compliance tile: appends (your test) / (heartbeat) / (manual) / (webhook) after Last checked: 3m ago. Empty string when verdict_source is null (never run).
  • History panel: per-run badge with the same source label. Info-blue for triggered_by = 'owner_test'; neutral for heartbeat / manual / webhook so pre-PR-1 rows render without regression.

No backend changes. Pure UI on fields the API already emits.

Stacked on

This PR's diff is read against #4250's branch. Merge order: #4250 → this PR.

Out of scope (later PRs in the #4247 stack)

  • PR 3 — drop agent_test_history, backfill owner-triggered rows into agent_compliance_runs, S3-export third-party rows. Destructive migration; soaks behind PR 2.
  • PR 4 — collapse agent_contexts.last_test_* columns into a derived view. Pure schema cleanup.

Test plan

  • tsc --noEmit -p server/tsconfig.json clean
  • Build passes
  • Manual smoke after deploy: register agent, run evaluate_agent_quality, observe dashboard compliance tile shows (your test). Wait for next heartbeat, observe it flips to (heartbeat).
  • Manual smoke: open History panel, observe owner-triggered runs render with blue Your test badge, heartbeat runs render with neutral badge.
  • Pre-PR-1 rows (no triggered_by value, or only 'heartbeat') render cleanly without empty/garbage labels.

@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented May 9, 2026

Code review (expert pass): clean. Will cascade after #4250 strips its committed dist/ files and #4264 fixes the dual-write-window dup-guard.

@EmmaLouise2018 EmmaLouise2018 force-pushed the claude/issue-4247-owner-test-canonical-write branch from c71809a to 42e7f37 Compare May 9, 2026 00:09
@EmmaLouise2018 EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from 47b26d4 to fd7d406 Compare May 9, 2026 01:07
bokelley added a commit that referenced this pull request May 11, 2026
…o keys (#4364)

* fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys

The compliance heartbeat has been writing zero rows to
agent_storyboard_status since the SDK switched comply() to storyboard-
driven testing. The SDK emits one TestResult per phase of each storyboard,
keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario
(see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation
walked the YAML's per-step `comply_scenario` field (bare names like
`signals_flow`, `capability_discovery`) and looked them up in the SDK's
scenario map. Every lookup missed → testedCount === 0 → every storyboard
skipped at the `continue` guard.

Effect across the registry:
  agent_storyboard_status total rows: 6  (across 4 agents)
  rows written by triggered_by='heartbeat': 0
  rows surviving were legacy bare-name keys from old manual runs

This silently broke the AAO Verified badge pipeline (no storyboard rows
→ deriveVerificationStatus has nothing to verify against) and every
agent's dashboard `storyboards_passing: 0 / N` was misleading: the
runner wasn't failing storyboards, the parser was dropping them.

Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios
clean but showing `degraded` because specialism_status.signal-owned read
'untested' from a never-populated agent_storyboard_status row.

Fix: read SDK output directly. Group scenarios by storyboard id, roll
per-step pass counts up from each phase's `steps` array, fall back to
phase-level counts when steps are absent. The `storyboardIds` override
is preserved for explicit-IDs callers that need an `untested` entry
when the runner didn't run a requested storyboard. The unused YAML
`comply_scenario` field is no longer load-bearing for status mapping
(the SDK already knows which storyboards it ran).

Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback,
legacy bare-name skip, empty input, and explicit-IDs untested gap.

Stack note: this is orthogonal to Emma's #4247 compliance-state
unification stack (#4250, #4263, #4264, #4268, #4274) which collapses
agent_test_history into agent_compliance_runs. Different files; rebases
cleanly in either order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(scripts): test-comply-storyboard-statuses — local harness for the fix

Runs comply() against an agent URL and prints what
deriveStoryboardStatuses would produce, without DB writes. Used to
validate the SDK-6.x scenario-key fix against real agents
(adcp-signals-adaptor.evgeny-193.workers.dev/mcp and
wonderstruck.sales-agent.scope3.com/mcp) before merging.

Will stay useful for future SDK upgrades that touch scenario emission
or storyboard-track aggregation — same pattern as the
diagnose-agent-comply-queue script from #4361.

Usage:
  npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests

Addresses code-reviewer feedback on PR #4364:
- JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total
  are not directly comparable across rows (some rows are real step counts,
  some are phase-level fallbacks when the SDK omits per-step data).
- Comment pinning the storyboard-id invariant (flat ids, no `/`) so the
  indexOf split stays correct as new storyboards land.
- Defensive `result.tracks ?? []` so a malformed result doesn't throw.
- Hoist `storyboardIds && length > 0` into a single `hasExplicitIds`
  const used at both the toEmit decision and the no-data fallback.
- Three new test cases:
  * same storyboard split across multiple tracks aggregates correctly
  * result.tracks absent → []
  * non-string scenario values (null, number) → skipped without throwing

12/12 vitest passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley force-pushed the claude/issue-4247-owner-test-canonical-write branch from 54a4f16 to f7f933b Compare May 11, 2026 09:10
bokelley added a commit that referenced this pull request May 11, 2026
Six changes responding to expert review:

1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_
   primary_brand_domain.sql on main since this PR was opened).

2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance
   response. Heartbeat and owner_test both call comply() against the same
   registered URL with the same owner-saved credentials; the verdict's
   truth content is identical regardless of who pulled the trigger.
   Surfacing the source label to buyers creates a trust distinction the
   underlying observation doesn't carry. `triggered_by` stays as internal
   audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps
   the operator UX cue.

3. Drop the legacy `recordComplianceRun(... 'manual')` write inside
   evaluate_agent_quality. Pre-PR, that write was invisible publicly.
   Post-PR (with verdict_source on the response), it would have leaked
   `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1.
   Even with verdict_source removed (item 2), the legacy write was
   gated only on `agent_contexts` row existence — which `save_agent`
   lets any org create for any URL without an ownership check — so a
   non-owner could publish a verdict on someone else's agent.
   The owner-test branch covers the dashboard-freshness use case with a
   real ownership check; the legacy public-state write has no remaining
   function. Legacy agent_test_history write retained as session-scoped
   audit until Emma's PR 3 cleans up.

4. Dashboard "Run this storyboard" endpoint now writes `triggered_by =
   'owner_test'` instead of `'manual'` (owner-only path, consistent with
   the new semantic). Legacy `'manual'` value remains in the enum for
   historical rows.

5. Rate limit on evaluate_agent_quality via existing Addie tool rate
   limiter (default 60/10min). comply() itself takes 10-60s per run so
   the natural ceiling is already ~1-2/min — this adds a hard wall above
   that. Security-reviewer's B2.

6. Updated changeset content + migration number references; replaces
   stale "migration 471" prose from earlier renumber.

Out of scope (deferred follow-ups):
- Badge issuance on owner_test transitions: owner currently waits up to 1h
  for next heartbeat to re-issue badge. Skipped because the badge fan-out
  in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its
  own refactor.
- Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface
  call applies, kept inline to bound this PR's scope.
- Active-membership filter: organization_memberships has no status
  column (row deletion = removal), so the security-review M2 concern
  doesn't apply at the schema layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 11, 2026
…ce state (#4250)

* fix(addie): owner evaluate_agent_quality writes to canonical compliance state

Closes the 12-hour gap between owner-triggered storyboard runs and the public
/api/registry/agents/:url/compliance endpoint (issue #4247, PR 1 of 4).

When evaluate_agent_quality is triggered by the agent owner, the result is now
written to agent_compliance_status + agent_compliance_runs + agent_storyboard_status
with triggered_by = 'owner_test'. Non-owner runs continue writing to agent_test_history
(deprecated in PR 3). Migration 471 adds 'owner_test' to both triggered_by CHECK
constraints. notifyComplianceChange is intentionally suppressed for owner runs to
prevent iteration-loop Slack spam.

https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb

* fix(addie): add verdict_source to compliance response + last-write-wins test

Address review feedback from @EmmaLouise2018 on PR #4250:

1. `verdict_source` field on /api/registry/agents/:url/compliance
   — `AgentComplianceDetailSchema` gains optional `verdict_source`:
     'heartbeat' | 'owner_test' | 'manual' | 'webhook' | null
   — `getComplianceStatus` and `bulkGetComplianceStatus` join
     `agent_compliance_runs` via LATERAL subquery (dry_run=false,
     ORDER BY tested_at DESC LIMIT 1) to surface the triggered_by
     of the most recent run.  No migration needed.
   — Endpoint response emits `verdict_source: status.last_triggered_by`.
   — `AgentComplianceStatus` interface gets `last_triggered_by` field.

2. Last-write-wins contract test
   — New `compliance-db-last-write-wins.test.ts` pins the ON CONFLICT
     DO UPDATE semantics: every recordComplianceRun call overwrites
     agent_compliance_status regardless of triggered_by source.  A
     future change to first-write-wins or priority ordering would
     break these tests.

https://claude.ai/code/session_01NVVqgeSGevUGXgDbMw1XKZ

* chore(dist): remove accidentally committed onboarding-openapi build artifacts

Generated JS/TS files don't belong in source control. Also adds
.gitignore rules for dist/schemas/*.{js,d.ts,*.map} to prevent recurrence.

https://claude.ai/code/session_01WrFMahjaHU7y4JWqPqxTbx

* fix(migrate): renumber 471 → 472 (resolve clash with manager_revalidation_queue on main)

* fix(addie): address Brian's review — DDL lock guard + cross-org gap note

* review iteration: address security + correctness feedback on #4250

Six changes responding to expert review:

1. Renumber migration 472 → 475 (collision with 472_drop_member_profiles_
   primary_brand_domain.sql on main since this PR was opened).

2. Drop `verdict_source` from the public /api/registry/agents/:url/compliance
   response. Heartbeat and owner_test both call comply() against the same
   registered URL with the same owner-saved credentials; the verdict's
   truth content is identical regardless of who pulled the trigger.
   Surfacing the source label to buyers creates a trust distinction the
   underlying observation doesn't carry. `triggered_by` stays as internal
   audit on agent_compliance_runs; Emma's #4263 dashboard surface keeps
   the operator UX cue.

3. Drop the legacy `recordComplianceRun(... 'manual')` write inside
   evaluate_agent_quality. Pre-PR, that write was invisible publicly.
   Post-PR (with verdict_source on the response), it would have leaked
   `verdict_source: 'manual'` for non-owner runs — security-reviewer's B1.
   Even with verdict_source removed (item 2), the legacy write was
   gated only on `agent_contexts` row existence — which `save_agent`
   lets any org create for any URL without an ownership check — so a
   non-owner could publish a verdict on someone else's agent.
   The owner-test branch covers the dashboard-freshness use case with a
   real ownership check; the legacy public-state write has no remaining
   function. Legacy agent_test_history write retained as session-scoped
   audit until Emma's PR 3 cleans up.

4. Dashboard "Run this storyboard" endpoint now writes `triggered_by =
   'owner_test'` instead of `'manual'` (owner-only path, consistent with
   the new semantic). Legacy `'manual'` value remains in the enum for
   historical rows.

5. Rate limit on evaluate_agent_quality via existing Addie tool rate
   limiter (default 60/10min). comply() itself takes 10-60s per run so
   the natural ceiling is already ~1-2/min — this adds a hard wall above
   that. Security-reviewer's B2.

6. Updated changeset content + migration number references; replaces
   stale "migration 471" prose from earlier renumber.

Out of scope (deferred follow-ups):
- Badge issuance on owner_test transitions: owner currently waits up to 1h
  for next heartbeat to re-issue badge. Skipped because the badge fan-out
  in compliance-heartbeat.ts is ~70 lines and extracting/sharing is its
  own refactor.
- Extracting resolveAgentOwnerOrg to a shared helper: same drift-surface
  call applies, kept inline to bound this PR's scope.
- Active-membership filter: organization_memberships has no status
  column (row deletion = removal), so the security-review M2 concern
  doesn't apply at the schema layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Emma Mulitz <emulitz@scope3.com>
Base automatically changed from claude/issue-4247-owner-test-canonical-write to main May 11, 2026 09:45
@bokelley bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from fd7d406 to a40cddd Compare May 11, 2026 09:54
PR 2 of the #4247 unification stack. Reads two fields PR #4250 added
to the compliance API but the dashboard wasn't yet rendering:

- compliance tile: appends "(your test)" / "(heartbeat)" / "(manual)"
  / "(webhook)" after Last checked, so operators see whether the
  current verdict came from their own evaluate_agent_quality run or
  the scheduled heartbeat.
- history panel: per-run badge with the same source label, info-blue
  for owner_test and neutral for the rest. Pre-PR-1 rows render with
  neutral — no regression.

No backend changes; pure UI surfacing of fields already in the API.
Stacked on PR #4250.
@bokelley bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from a40cddd to 48f96aa Compare May 11, 2026 10:00
@bokelley
Copy link
Copy Markdown
Contributor

Review iteration 48f96aa:

Reviewer flagged that gating verdict_source on is_api_access_tier was too narrow — free-tier (Explorer) owners would see null on their own agent's dashboard and miss the "(your test)" cue.

Fix: added is_owner: boolean to the OwnerMembership internal struct in services/membership-tiers.ts. Gate verdict_source on ownerMembership.is_owner (any owner, any tier) instead. Other owner-scoped fields (membership_tier, subscription_status) are unchanged — they still populate per the existing semantics. is_owner stays internal (not in the API response) so the owner-detect-resistant response-shape property is preserved.

Updated membership-tiers.test.ts to include is_owner in the expected shapes; 21/21 passing.

Also trimmed the schema description per the reviewer's nit.

@bokelley bokelley merged commit 3f160f0 into main May 11, 2026
13 checks passed
@bokelley bokelley deleted the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch May 11, 2026 10:04
bokelley added a commit that referenced this pull request May 11, 2026
#4378) (#4389)

Adds 6 explicit unit tests to `tests/unit/membership-tiers.test.ts`
covering the `is_owner` field that gates `verdict_source` exposure on
the public `/api/registry/agents/:url/compliance` response. The matrix:

- Anonymous (no userId) → is_owner: false
- User not a member of any owning org → is_owner: false
- Resolved org deleted (orphan profile) → is_owner: false
- Owner on free Explorer tier → is_owner: true (broader than
  is_api_access_tier, which is false for Explorer — pins that free-
  tier owners still get verdict_source on their own dashboard)
- Owner on API-access tier with active sub → is_owner: true
- Owner with canceled sub → is_owner: true (ownership ≠ entitlement;
  the dashboard cue still works, badges don't until reactivation)

Pre-PR #4263 review feedback flagged that gating verdict_source on
is_api_access_tier (the previous design) was too narrow — free-tier
owners would see null on their own agent. These tests anchor the
correction (is_owner is the right gate) so a future refactor can't
silently regress.

Full route-level integration test (anonymous → null, cross-org →
null, owner → populated through `request(app).get(...)`) deferred —
requires DB-fixture scaffolding not currently in place for this
endpoint. Follow-up will set up the harness and write those.

Partial completion of #4378. Leaves the issue open for the
integration-test follow-up.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EmmaLouise2018 added a commit that referenced this pull request May 11, 2026
PR 3 of the #4247 unification stack.

Migration 472 backfills every agent_test_history row with a user_id
into agent_compliance_runs as triggered_by='owner_test', carrying the
source id in observations_json.backfill_source_id for idempotent re-runs.
Each agent's latest backfilled row upserts into agent_compliance_status
so the dashboard immediately reflects a real verdict for agents tested
through Addie pre-PR-#4250.

evaluate_agent_quality stops calling recordTest() when the caller owns
the agent — that was the dual-write that #4247 is closing. recordTest
is retained ONLY for third-party runs so strangers testing someone
else's agent still have a session-scoped audit trail.

Drop of agent_test_history table is deferred behind the 14-day soak
from #4250 + 7-day soak from #4263 + S3 cold-storage export of
non-owner rows. Migration 472 documents this in its trailing comment.

Stacked on #4263#4250.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants