fix(addie): backfill owner test history + stop dual-write for owner runs by EmmaLouise2018 · Pull Request #4264 · adcontextprotocol/adcp

EmmaLouise2018 · 2026-05-08T20:24:58Z

PR 3 of the #4247 unification stack. Stacked on #4263 → #4250.

Summary

Two coupled changes that close the dual-write bug #4247 is targeting:

Backfill historical owner-triggered tests into canonical state. Migration 472_backfill_agent_test_history_to_compliance_runs.sql copies every `agent_test_history` row with `user_id IS NOT NULL` into `agent_compliance_runs` as `triggered_by = 'owner_test'`, carrying the source row id in `observations_json.backfill_source_id` for idempotent re-runs (`WHERE NOT EXISTS` guard). Each agent's most-recent backfilled row also upserts into `agent_compliance_status` — last-write-wins on `(agent_url)` per the contract pinned in #4250's race test — so the dashboard's compliance tile immediately reflects a real verdict for any agent that was tested through Addie pre-PR-#4250 and never ran the heartbeat.

Stop the dual write for owner runs. `evaluate_agent_quality` no longer calls `agentContextDb.recordTest()` when the caller owns the agent — that's exactly the dual-write that #4247 is closing. The legacy `recordTest` call is retained ONLY for third-party runs so a stranger who tests someone else's agent still has a session-scoped audit trail in their own `agent_test_history`. Owner-triggered runs persist exclusively to canonical state going forward.

Why

PR #4250 made owner runs write to the canonical tables but kept the legacy `recordTest` write for backward compatibility. Two writers per owner run → drift surface. This PR removes the legacy write for owner runs and backfills the rows that were already written through the legacy path so we don't lose history.

Brian's bar: don't lose data silently. Migration 472's idempotent guard + `backfill_source_id` provenance means a re-run is a no-op and any future audit can trace back to the original `agent_test_history` row.

Out of scope (deferred)

Drop `agent_test_history`. Gated on:
- 14-day soak from fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 deploy
- 7-day soak from feat(dashboard): surface verdict_source + per-run triggered_by badge #4263 deploy
- S3 cold-storage export of `user_id IS NULL` rows (the third-party / scheduled / unattributed cohort) so audit trail isn't lost
- Row-count delta verification on staging
  Migration 472 documents this in its trailing comment.
Collapse `agent_contexts.last_test_*` into a derived view — PR 4 of the Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247 stack.

Stacked on

feat(dashboard): surface verdict_source + per-run triggered_by badge #4263 (PR 2 of Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247) — dashboard surfaces verdict_source + triggered_by badge
fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 (PR 1 of Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247) — owner evaluate_agent_quality writes to canonical compliance state

Merge order: #4250 → #4263 → this PR.

Test plan

`tsc --noEmit -p server/tsconfig.json` clean
Migration is idempotent — re-running on a database with a partial backfill is a no-op via the `WHERE NOT EXISTS` guard on `backfill_source_id`
Staging dry-run: count owner-triggered `agent_test_history` rows pre-migration vs `agent_compliance_runs` rows added — should match exactly
Smoke after deploy: run `evaluate_agent_quality` as owner, confirm `agent_compliance_runs` gets a `triggered_by='owner_test'` row AND `agent_test_history` does NOT get a new row
Smoke after deploy: third-party runs still write to `agent_test_history` (session audit trail preserved)
`/dashboard/agents` history panel shows historical "Your test" badges from the backfill (proves the migration landed and the rendered `triggered_by` is correct)

bokelley · 2026-05-09T00:02:34Z

Code review (expert pass): block — three issues that need addressing before this lands.

1. Backfill is not chunked or transactional safety-netted.
server/src/db/migrations/472_backfill_agent_test_history_to_compliance_runs.sql:31-58 is a single bulk INSERT … SELECT. On any prod-sized agent_test_history, this holds locks the whole time and is a single point of failure. Per repo convention (see feedback_prod_runnable_scripts_path.md precedent), prod backfill belongs in server/src/scripts/ with chunked windows + dry-run, not a release-command migration. At minimum, chunk by started_at window inside a DO $$ block.

2. Status-priority contract drift between PR and #4250's test.
Migration 472:80-122 implements first-newer-wins via last_checked_at < EXCLUDED.last_checked_at guards on the upsert. PR #4250's compliance-db-last-write-wins.test.ts asserts plain last-write-wins. The PR body says "heartbeat always wins on freshness" — pick one. The test in #4250 will pass at #4250 time and start failing the contract this migration enforces. Reconcile before merge.

3. Partial dual-write window is unguarded.
Between #4250 deploy (canonical writes start) and this PR's deploy (backfill runs), owner runs write to BOTH canonical and agent_test_history. The migration's WHERE NOT EXISTS … backfill_source_id guard correctly skips already-backfilled rows — but owner runs in that window have no backfill_source_id linkage, so they'll be inserted twice into agent_compliance_runs (once by #4250's live write path, once by the backfill).

Add a guard: AND ath.started_at < (SELECT MIN(tested_at) FROM agent_compliance_runs WHERE triggered_by='owner_test') or equivalent — only backfill rows from before the canonical-write cutover.

#4263 is clean and #4268 is solid; this PR is the chain blocker.

…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR 3 of the #4247 unification stack. Migration 472 backfills every agent_test_history row with a user_id into agent_compliance_runs as triggered_by='owner_test', carrying the source id in observations_json.backfill_source_id for idempotent re-runs. Each agent's latest backfilled row upserts into agent_compliance_status so the dashboard immediately reflects a real verdict for agents tested through Addie pre-PR-#4250. evaluate_agent_quality stops calling recordTest() when the caller owns the agent — that was the dual-write that #4247 is closing. recordTest is retained ONLY for third-party runs so strangers testing someone else's agent still have a session-scoped audit trail. Drop of agent_test_history table is deferred behind the 14-day soak from #4250 + 7-day soak from #4263 + S3 cold-storage export of non-owner rows. Migration 472 documents this in its trailing comment. Stacked on #4263 → #4250.

…drop status upsert Addresses code review on PR #4264. 1. Move backfill out of DDL migration into server/src/scripts/. A bulk INSERT in a release-command migration holds locks for the duration on any prod-sized agent_test_history. The script chunks by primary-key id ascending (default 1000 rows / chunk, 100ms sleep) so heartbeat and runtime writes never wait on a long-running lock. Tunable via --chunk-size / --sleep-ms. --dry-run counts eligible rows without writing. 2. Add cutover guard so the script doesn't double-write owner runs that landed via PR #4250's runtime path during the PR-#4250-deployed-but-script-not-yet-run window. The guard skips any agent_test_history row with started_at >= MIN(tested_at) where triggered_by='owner_test' in agent_compliance_runs. Fresh deploys with no canonical owner_test rows treat cutover as +infinity and backfill all eligible rows. 3. Drop the agent_compliance_status upsert. Backfilling historical history rows should not retroactively change "current" status; the runtime canonical write path maintains that table. Resolves the contract-drift concern between PR #4250's last-write-wins test and the previous first-newer-wins guard in the migration — the script only writes to the history table, leaving the status table's runtime contract intact. Idempotency unchanged: observations_json.backfill_source_id + WHERE NOT EXISTS guard means re-runs are no-ops on already-backfilled rows.

This was referenced May 8, 2026

feat(compliance): derive agent_context.last_test_* from canonical runs #4268

Open

chore(compliance): drop agent_test_history + last_test_* — final cleanup #4274

Open

bokelley mentioned this pull request May 9, 2026

fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250

Merged

bokelley mentioned this pull request May 9, 2026

feat(dashboard): surface verdict_source + per-run triggered_by badge #4263

Merged

5 tasks

EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from 47b26d4 to fd7d406 Compare May 9, 2026 01:07

This was referenced May 9, 2026

ci(release): forward-merge 3.0.x → main silently drops all 3.0.x changes since allowlist landed #4306

Closed

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys #4364

Merged

bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch 2 times, most recently from a40cddd to 48f96aa Compare May 11, 2026 10:00

Base automatically changed from EmmaLouise2018/unification-pr2-dashboard-triggered-by to main May 11, 2026 10:04

EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr3-backfill-and-drop-test-history branch from 0f0104c to 5618e44 Compare May 11, 2026 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(addie): backfill owner test history + stop dual-write for owner runs#4264

fix(addie): backfill owner test history + stop dual-write for owner runs#4264
EmmaLouise2018 wants to merge 2 commits into
mainfrom
EmmaLouise2018/unification-pr3-backfill-and-drop-test-history

EmmaLouise2018 commented May 8, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmmaLouise2018 commented May 8, 2026

Summary

Why

Out of scope (deferred)

Stacked on

Test plan

Uh oh!

bokelley commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants