feat(compliance): derive agent_context.last_test_* from canonical runs by EmmaLouise2018 · Pull Request #4268 · adcontextprotocol/adcp

EmmaLouise2018 · 2026-05-08T20:30:46Z

PR 4 of the #4247 unification stack. Stacked on #4264 → #4263 → #4250.

Summary

Replaces direct reads of `agent_contexts.last_test_*` with a view that derives them from `agent_compliance_runs` — the canonical source PR #4250 unified onto. Adds `triggered_org_id` so the view's per-org scope is accurate.

Why

After PR 1-3, owner test runs land in `agent_compliance_runs` with `triggered_by = 'owner_test'`. But that table only carries `agent_url`, no org dimension. Today's `agent_contexts.last_test_*` columns ARE org-scoped (`agent_contexts` PK is `(organization_id, agent_url)`). To collapse one into the other without losing semantics, runs need a triggering-org dimension.

`triggered_org_id` (nullable; populated only for `triggered_by = 'owner_test'`) closes the gap. Heartbeat / manual / webhook writes have no org, so NULL is correct.

What changes

Migration 473. Adds `agent_compliance_runs.triggered_org_id TEXT`. Partial index on `(triggered_org_id, agent_url, tested_at DESC) WHERE triggered_org_id IS NOT NULL` supports the view's per-org `DISTINCT ON` as a single index scan.
View `agent_context_with_latest_test`. `agent_contexts.` joined via `LEFT JOIN LATERAL` to the latest non-dry-run `agent_compliance_runs` row scoped by `(triggered_org_id, agent_url)`. Surfaces derived fields as `canonical_last_test_`.
`AgentContextDatabase` readers (`getByOrganization`, `getById`, `getByOrgAndUrl`) now SELECT from the view, aliasing `canonical_last_test_` → `last_test_` so callers see no shape change.
Owner-test write in `evaluate_agent_quality` populates `triggered_org_id` from the caller's `organizationId`.

Backward compat

The legacy `agent_contexts.last_test_*` columns stay. Third-party (non-owner) `recordTest()` writes still update them — that's the session-scoped audit trail PR 3 retained for non-owner runs. The columns become dead-letter once:

`agent_test_history` is dropped (gated on the soak windows in Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247)
`recordTest()` retires (follow-up "final cleanup" PR)

This PR is the prep work that makes that final cleanup a no-op for readers.

Stacked on

fix(addie): backfill owner test history + stop dual-write for owner runs #4264 (PR 3) — backfill owner test history + stop dual-write for owner runs
feat(dashboard): surface verdict_source + per-run triggered_by badge #4263 (PR 2) — dashboard surfaces verdict_source + triggered_by badge
fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 (PR 1) — owner evaluate_agent_quality writes to canonical compliance state

Merge order: #4250 → #4263 → #4264 → this PR.

Test plan

`tsc --noEmit -p server/tsconfig.json` clean
Migration 473 applies cleanly on staging (`ALTER TABLE ADD COLUMN IF NOT EXISTS` is non-destructive; `CREATE OR REPLACE VIEW` is idempotent)
Owner test run populates `triggered_org_id` correctly; heartbeat run leaves it NULL
`getByOrganization` returns `last_test_*` derived from the view (verify via test that compares pre-migration column values to post-migration view-derived values for an owner-triggered run)
Cross-org isolation: two orgs with the same agent URL see distinct `last_test_*` derivations from the view

…ce state Closes the 12-hour gap between owner-triggered storyboard runs and the public /api/registry/agents/:url/compliance endpoint (issue #4247, PR 1 of 4). When evaluate_agent_quality is triggered by the agent owner, the result is now written to agent_compliance_status + agent_compliance_runs + agent_storyboard_status with triggered_by = 'owner_test'. Non-owner runs continue writing to agent_test_history (deprecated in PR 3). Migration 471 adds 'owner_test' to both triggered_by CHECK constraints. notifyComplianceChange is intentionally suppressed for owner runs to prevent iteration-loop Slack spam. https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb

Build-generated output produced by npm run build; matches the tracking pattern of member-agents-openapi.js and registry.js already in dist/schemas/. https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb

@EmmaLouise2018

…ns test Address review feedback from @EmmaLouise2018 on PR #4250: 1. `verdict_source` field on /api/registry/agents/:url/compliance — `AgentComplianceDetailSchema` gains optional `verdict_source`: 'heartbeat' | 'owner_test' | 'manual' | 'webhook' | null — `getComplianceStatus` and `bulkGetComplianceStatus` join `agent_compliance_runs` via LATERAL subquery (dry_run=false, ORDER BY tested_at DESC LIMIT 1) to surface the triggered_by of the most recent run. No migration needed. — Endpoint response emits `verdict_source: status.last_triggered_by`. — `AgentComplianceStatus` interface gets `last_triggered_by` field. 2. Last-write-wins contract test — New `compliance-db-last-write-wins.test.ts` pins the ON CONFLICT DO UPDATE semantics: every recordComplianceRun call overwrites agent_compliance_status regardless of triggered_by source. A future change to first-write-wins or priority ordering would break these tests. https://claude.ai/code/session_01NVVqgeSGevUGXgDbMw1XKZ

PR 2 of the #4247 unification stack. Reads two fields PR #4250 added to the compliance API but the dashboard wasn't yet rendering: - compliance tile: appends "(your test)" / "(heartbeat)" / "(manual)" / "(webhook)" after Last checked, so operators see whether the current verdict came from their own evaluate_agent_quality run or the scheduled heartbeat. - history panel: per-run badge with the same source label, info-blue for owner_test and neutral for the rest. Pre-PR-1 rows render with neutral — no regression. No backend changes; pure UI surfacing of fields already in the API. Stacked on PR #4250.

PR 3 of the #4247 unification stack. Migration 472 backfills every agent_test_history row with a user_id into agent_compliance_runs as triggered_by='owner_test', carrying the source id in observations_json.backfill_source_id for idempotent re-runs. Each agent's latest backfilled row upserts into agent_compliance_status so the dashboard immediately reflects a real verdict for agents tested through Addie pre-PR-#4250. evaluate_agent_quality stops calling recordTest() when the caller owns the agent — that was the dual-write that #4247 is closing. recordTest is retained ONLY for third-party runs so strangers testing someone else's agent still have a session-scoped audit trail. Drop of agent_test_history table is deferred behind the 14-day soak from #4250 + 7-day soak from #4263 + S3 cold-storage export of non-owner rows. Migration 472 documents this in its trailing comment. Stacked on #4263 → #4250.

PR 4 of the #4247 unification stack. Adds triggered_org_id to agent_compliance_runs so per-org scoping of the new agent_context_with_latest_test view is accurate. Without it, two orgs that own the same agent URL would conflate test history. Owner-test write path in evaluate_agent_quality populates it from the caller's organizationId; heartbeat/manual/webhook leave it NULL. agent_context_with_latest_test view: agent_contexts.* joined LATERAL to the latest non-dry-run agent_compliance_runs row scoped by (triggered_org_id, agent_url), plus COUNT for total_tests_run. agent-context-db.ts readers (getByOrganization, getById, getByOrgAndUrl) SELECT from the view and alias canonical_last_test_* → last_test_* so callers see no shape change. Legacy columns stay for backward compat — third-party recordTest() writes still hit them (session-scoped audit retained per PR 3). The columns + recordTest retire in the follow-up "final cleanup" PR that drops agent_test_history. Stacked on #4264 → #4263 → #4250.

bokelley · 2026-05-09T00:02:54Z

Code review (expert pass): solid, minor nits — but blocked by #4264 in the chain.

Nits (non-blocking):

agent_compliance_runs.triggered_org_id is TEXT, not FK to organizations.id/workos_organization_id. No referential integrity; an org delete leaves dangling rows. WorkOS IDs are foreign-system so a hard FK isn't appropriate, but a comment or partial CHECK on shape would help.
last_test_scenario derivation tracks_json -> 0 ->> 'track' is a semantic shift from the literal 'quality_evaluation' to a track name. Grep confirms no callers branch on the string, so safe — but call out in the changeset for downstream consumers.

View is plain CREATE OR REPLACE VIEW (not SECURITY DEFINER), inherits caller's RLS, per-org scoping correct. No security concern.

…cenario semantic shift in changeset

bokelley · 2026-05-09T00:05:50Z

Both nits addressed in 1103d56:

triggered_org_id comment — added COMMENT ON COLUMN in migration 473 explaining the TEXT/no-FK decision (WorkOS IDs are foreign-system keys; DB-layer referential integrity isn't appropriate here).
Changeset — added a **Semantic shift (last_test_scenario).** paragraph calling out the tracks_json[0].track vs. legacy literal-string difference for downstream consumers.

Blocked-by #4264 noted — this PR stays parked until the stack merges in order.

Generated by Claude Code

…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude and others added 6 commits May 8, 2026 19:00

chore(dist): add generated onboarding-openapi schema files

7059147

Build-generated output produced by npm run build; matches the tracking pattern of member-agents-openapi.js and registry.js already in dist/schemas/. https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb

EmmaLouise2018 mentioned this pull request May 8, 2026

chore(compliance): drop agent_test_history + last_test_* — final cleanup #4274

Open

12 tasks

bokelley mentioned this pull request May 9, 2026

fix(addie): backfill owner test history + stop dual-write for owner runs #4264

Open

6 tasks

nit: comment on triggered_org_id TEXT rationale; call out last_test_s…

1103d56

…cenario semantic shift in changeset

bokelley mentioned this pull request May 11, 2026

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys #4364

Merged

bokelley mentioned this pull request May 11, 2026

fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250

Merged

EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr3-backfill-and-drop-test-history branch from 0f0104c to 5618e44 Compare May 11, 2026 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compliance): derive agent_context.last_test_* from canonical runs#4268

feat(compliance): derive agent_context.last_test_* from canonical runs#4268
EmmaLouise2018 wants to merge 7 commits into
EmmaLouise2018/unification-pr3-backfill-and-drop-test-historyfrom
EmmaLouise2018/unification-pr4-collapse-last-test-columns

EmmaLouise2018 commented May 8, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EmmaLouise2018 commented May 8, 2026

Summary

Why

What changes

Backward compat

Stacked on

Test plan

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants