chore(compliance): drop agent_test_history + last_test_* — final cleanup by EmmaLouise2018 · Pull Request #4274 · adcontextprotocol/adcp

EmmaLouise2018 · 2026-05-08T23:07:07Z

PR 5 (final cleanup) of the #4247 unification stack. Stacked on #4268 → #4264 → #4263 → #4250.

Summary

Drops the dual-write infrastructure now that PR #4250 writes canonical, PR #4264 backfilled history, and PR #4268 derives `last_test_*` via the `agent_context_with_latest_test` view. Closes #4247.

Pre-merge gate (load-bearing — destructive migration)

DO NOT MERGE until each is satisfied:

PR fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 ≥ 14 days live in prod with zero canonical-write incidents (no malformed `agent_compliance_status` row, no flap reports)
PR feat(dashboard): surface verdict_source + per-run triggered_by badge #4263 ≥ 7 days live in prod, dashboard rendering identical verdicts via the view-derived path on a hand-audited sample
PR fix(addie): backfill owner test history + stop dual-write for owner runs #4264's migration 472 has run; row-count delta on staging is ±0 (every owner-triggered `agent_test_history` row backfilled into `agent_compliance_runs`)
S3 cold-storage export of third-party (`user_id IS NULL`) `agent_test_history` rows complete; export evidence committed to ops runbook. Reversibility path is the export, not pg_dump
PR feat(compliance): derive agent_context.last_test_* from canonical runs #4268's view + reader migration confirmed working in prod

The migration is destructive and irreversible. The S3 export from gate (4) is the recovery path.

What changes

Migration 474:
- Redefines `agent_context_summary` view without references to the dropped table/columns (history counts now derive from `agent_compliance_runs`)
- Drops `agent_contexts.last_test_*` columns
- Drops `agent_test_history` table
- Refreshes `agent_context_with_latest_test` so the `ac.*` projection no longer carries the removed columns
`agent-context-db.ts`:
- Removes `recordTest`, `getTestHistory`, `getLatestTestForUser`
- Removes `AgentTestHistory` and `RecordTestInput` interfaces
- `update()` no longer SETs `last_test_*`; refetches via `getById()` after the UPDATE so derived view fields stay populated for callers
`evaluate_agent_quality`: drops the third-party `recordTest()` call. Non-owner runs are session-scoped — they return results in the response and do not persist
`run_storyboard`: drops its `recordTest()` call. Single-storyboard runs are session-scoped (canonical writes for single storyboards would over-state coverage)

Behavior change (named per Brian's bar)

Third-party `evaluate_agent_quality` runs against someone else's agent no longer leave any persistent state. Matches the "owner-only canonical writes" policy from Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247
`run_storyboard` runs (any caller) no longer leave persistent state. The dashboard's "tested at" timestamps for an org reflect only `evaluate_agent_quality` runs (full comply suite). Single-storyboard runs are exploratory tooling
The `AgentContext` shape on read remains unchanged — `last_test_*` fields still appear in the response, sourced from the view

S3 export script (operator runs before migration)

fly ssh console -a adcp-docs
cd /app && cat > /app/export-third-party-history.mjs <<'EOF'
import pg from 'pg';
import { writeFile } from 'node:fs/promises';
const c = new pg.Client({ connectionString: process.env.DATABASE_URL });
await c.connect();
const r = await c.query(\`
  SELECT id, agent_context_id, scenario, overall_passed,
         steps_passed, steps_failed, total_duration_ms,
         summary, dry_run, brief, triggered_by, user_id,
         steps_json, agent_profile_json, started_at, completed_at
    FROM agent_test_history
   WHERE user_id IS NULL
\`);
const jsonl = r.rows.map(row => JSON.stringify(row)).join('\\n');
await writeFile('/app/agent_test_history_third_party.jsonl', jsonl);
console.log(\`Exported \${r.rowCount} third-party rows\`);
await c.end();
EOF
node /app/export-third-party-history.mjs
# upload /app/agent_test_history_third_party.jsonl to S3 cold storage
# commit upload evidence (SHA256 + S3 path) to ops runbook

Stacked on

feat(compliance): derive agent_context.last_test_* from canonical runs #4268 (PR 4) — derive `last_test_*` from canonical runs via view
fix(addie): backfill owner test history + stop dual-write for owner runs #4264 (PR 3) — backfill owner test history + stop dual-write
feat(dashboard): surface verdict_source + per-run triggered_by badge #4263 (PR 2) — dashboard surfaces verdict_source + triggered_by badge
fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250 (PR 1) — owner `evaluate_agent_quality` writes canonical state

Merge order: #4250 → #4263 → #4264 → #4268 → this PR.

Closes #4247.

Test plan

`tsc --noEmit -p server/tsconfig.json` clean
Migration 474 applies cleanly on staging (after migration 473 from PR feat(compliance): derive agent_context.last_test_* from canonical runs #4268)
Post-migration: `SELECT * FROM information_schema.tables WHERE table_name = 'agent_test_history'` returns 0 rows
Post-migration: `agent_context_summary` and `agent_context_with_latest_test` views still serve queries cleanly
Post-migration: `SELECT last_test_scenario, last_test_passed, last_test_summary, last_tested_at, total_tests_run FROM agent_context_with_latest_test LIMIT 5` returns view-derived values for owner-triggered rows
Smoke after deploy: third-party `evaluate_agent_quality` returns results but does NOT create rows in any table
Smoke after deploy: owner `evaluate_agent_quality` creates a row in `agent_compliance_runs` with `triggered_by='owner_test'` AND `triggered_org_id` set

PR 5 of the #4247 unification stack. Removes the dual-write infrastructure now that PR #4250 writes canonically, PR #4264 backfilled history, and PR #4268 derives last_test_* via the agent_context_with_latest_test view. Migration 474 (gate-protected — destructive): - Redefines agent_context_summary without agent_test_history refs - Drops agent_contexts.last_test_* columns - Drops agent_test_history table - Refreshes agent_context_with_latest_test Code: - agent-context-db.ts drops recordTest, getTestHistory, getLatestTestForUser, AgentTestHistory, RecordTestInput - update() refetches via getById() so derived view fields stay populated - evaluate_agent_quality drops the third-party recordTest call - run_storyboard drops its recordTest call Behavior change: third-party evaluate_agent_quality and any run_storyboard call no longer persist registry state. Matches the owner-only canonical writes policy from #4247. Stacked on #4268 → #4264 → #4263 → #4250.

bokelley · 2026-05-09T00:02:44Z

Code review (expert pass): block — final cleanup needs three fixes.

1. Verify server/src/db/org-merge-db.ts is clean.
org-merge-db.ts:637 has a comment referencing agent_test_history. The merge flow may rely on ON DELETE CASCADE from agent_contexts. With the table dropped, runtime impact needs explicit confirmation — read the surrounding block before merge and either remove the dead comment or adjust the merge logic.

2. S3 export script is in the PR body, not in server/src/scripts/.
Per repo convention (see feedback_prod_runnable_scripts_path.md), prod-runnable scripts live under server/src/scripts/ and must call initializeDatabase() before getPool(). The body's heredoc bypasses both. Move it to a tracked file before merge.

3. Wrap migration 474 in an explicit transaction.
The destructive ordering (drop view → drop columns → drop table → recreate view) inside one migration file means a partial failure leaves the DB un-bootable. Wrap in BEGIN; … COMMIT;, or split into two files (drop, then recreate-view). Postgres DDL is mostly transactional but mixing column drops with view recreates needs the explicit boundary for safe rollback.

Nit (non-blocking):
update() does UPDATE-then-getById (two round-trips). UPDATE … RETURNING with an explicit column list (excluding the dropped ones) would be one query. Premature simplification — fine to defer.

bokelley · 2026-05-09T00:08:20Z

Blocker fixes prepared — needs author to apply

All three blockers from @bokelley's review have been worked out. I can't push directly to EmmaLouise2018/* branches, so the diffs are below for @EmmaLouise2018 to apply. The nit (UPDATE … RETURNING) is deferred per reviewer guidance.

Fix 1 — org-merge-db.ts: remove dead agent_test_history reference

Two spots. The cascade concern is moot (no child tables remain); logic is unchanged.

-    // UNIQUE(organization_id, agent_url): keep primary's row on
-    // conflict so its agent_test_history (ON DELETE CASCADE) survives;
-    // secondary's history is removed when its row is deleted.
+    // UNIQUE(organization_id, agent_url): on conflict keep primary's
+    // row; secondary's conflicting rows are deleted below (no child
+    // tables remain after migration 474 dropped agent_test_history).

-        `${agentContextsDeleted.rows.length} duplicate agent_contexts from secondary org were deleted (primary already had the same agent_url) — their test history was removed`
+        `${agentContextsDeleted.rows.length} duplicate agent_contexts from secondary org were deleted (primary already had the same agent_url)`

Fix 2 — migration 474: explicit transaction

 -- the S3 export from gate (4), not pg_dump.
 
+BEGIN;
+
 -- ── Phase 1: drop the dependent view …

 ) AS run_counts ON TRUE;
+
+COMMIT;

Fix 3 — S3 export script: server/src/scripts/export-third-party-test-history.ts (new file)

import { writeFile } from 'node:fs/promises';
import { initializeDatabase, getPool, closeDatabase } from '../db/client.js';
import { getDatabaseConfig } from '../config.js';

async function main(): Promise<void> {
  const dbConfig = getDatabaseConfig();
  if (!dbConfig) { console.error('DATABASE_URL is required'); process.exit(1); }
  initializeDatabase(dbConfig);
  const pool = getPool();

  const result = await pool.query(`
    SELECT id, agent_context_id, scenario, overall_passed,
           steps_passed, steps_failed, total_duration_ms,
           summary, dry_run, brief, triggered_by, user_id,
           steps_json, agent_profile_json, started_at, completed_at
      FROM agent_test_history
     WHERE user_id IS NULL
  `);

  const jsonl = result.rows.map((row) => JSON.stringify(row)).join('\n');
  const outPath = '/app/agent_test_history_third_party.jsonl';
  await writeFile(outPath, jsonl, 'utf-8');
  console.log(`Exported ${result.rowCount} third-party rows to ${outPath}`);
  console.log('Next: upload to S3 cold storage and commit SHA256 + S3 path to ops runbook.');
  await closeDatabase();
}

main().catch((err) => { console.error(err); process.exit(1); });

The full doc-comment header and usage instructions are in the prepared commit. The initializeDatabase() → getPool() ordering follows repair-dangling-primary-orgs.ts and the other scripts in that directory.

Session: https://claude.ai/code/session_0195XGWfSj96CJCJcxhxUX6m

Generated by Claude Code

…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bokelley mentioned this pull request May 11, 2026

fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys #4364

Merged

bokelley mentioned this pull request May 11, 2026

fix(addie): owner evaluate_agent_quality writes to canonical compliance state #4250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(compliance): drop agent_test_history + last_test_* — final cleanup#4274

chore(compliance): drop agent_test_history + last_test_* — final cleanup#4274
EmmaLouise2018 wants to merge 1 commit into
EmmaLouise2018/unification-pr4-collapse-last-test-columnsfrom
EmmaLouise2018/unification-pr5-final-cleanup-drop-test-history

EmmaLouise2018 commented May 8, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmmaLouise2018 commented May 8, 2026

Summary

Pre-merge gate (load-bearing — destructive migration)

What changes

Behavior change (named per Brian's bar)

S3 export script (operator runs before migration)

Stacked on

Test plan

Uh oh!

bokelley commented May 9, 2026

Uh oh!

bokelley commented May 9, 2026

Blocker fixes prepared — needs author to apply

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants