Skip to content

fix(addie): backfill owner test history + stop dual-write for owner runs#4264

Open
EmmaLouise2018 wants to merge 2 commits into
mainfrom
EmmaLouise2018/unification-pr3-backfill-and-drop-test-history
Open

fix(addie): backfill owner test history + stop dual-write for owner runs#4264
EmmaLouise2018 wants to merge 2 commits into
mainfrom
EmmaLouise2018/unification-pr3-backfill-and-drop-test-history

Conversation

@EmmaLouise2018
Copy link
Copy Markdown
Contributor

PR 3 of the #4247 unification stack. Stacked on #4263#4250.

Summary

Two coupled changes that close the dual-write bug #4247 is targeting:

Backfill historical owner-triggered tests into canonical state. Migration 472_backfill_agent_test_history_to_compliance_runs.sql copies every `agent_test_history` row with `user_id IS NOT NULL` into `agent_compliance_runs` as `triggered_by = 'owner_test'`, carrying the source row id in `observations_json.backfill_source_id` for idempotent re-runs (`WHERE NOT EXISTS` guard). Each agent's most-recent backfilled row also upserts into `agent_compliance_status` — last-write-wins on `(agent_url)` per the contract pinned in #4250's race test — so the dashboard's compliance tile immediately reflects a real verdict for any agent that was tested through Addie pre-PR-#4250 and never ran the heartbeat.

Stop the dual write for owner runs. `evaluate_agent_quality` no longer calls `agentContextDb.recordTest()` when the caller owns the agent — that's exactly the dual-write that #4247 is closing. The legacy `recordTest` call is retained ONLY for third-party runs so a stranger who tests someone else's agent still has a session-scoped audit trail in their own `agent_test_history`. Owner-triggered runs persist exclusively to canonical state going forward.

Why

PR #4250 made owner runs write to the canonical tables but kept the legacy `recordTest` write for backward compatibility. Two writers per owner run → drift surface. This PR removes the legacy write for owner runs and backfills the rows that were already written through the legacy path so we don't lose history.

Brian's bar: don't lose data silently. Migration 472's idempotent guard + `backfill_source_id` provenance means a re-run is a no-op and any future audit can trace back to the original `agent_test_history` row.

Out of scope (deferred)

Stacked on

Merge order: #4250#4263 → this PR.

Test plan

  • `tsc --noEmit -p server/tsconfig.json` clean
  • Migration is idempotent — re-running on a database with a partial backfill is a no-op via the `WHERE NOT EXISTS` guard on `backfill_source_id`
  • Staging dry-run: count owner-triggered `agent_test_history` rows pre-migration vs `agent_compliance_runs` rows added — should match exactly
  • Smoke after deploy: run `evaluate_agent_quality` as owner, confirm `agent_compliance_runs` gets a `triggered_by='owner_test'` row AND `agent_test_history` does NOT get a new row
  • Smoke after deploy: third-party runs still write to `agent_test_history` (session audit trail preserved)
  • `/dashboard/agents` history panel shows historical "Your test" badges from the backfill (proves the migration landed and the rendered `triggered_by` is correct)

@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented May 9, 2026

Code review (expert pass): block — three issues that need addressing before this lands.

1. Backfill is not chunked or transactional safety-netted.
server/src/db/migrations/472_backfill_agent_test_history_to_compliance_runs.sql:31-58 is a single bulk INSERT … SELECT. On any prod-sized agent_test_history, this holds locks the whole time and is a single point of failure. Per repo convention (see feedback_prod_runnable_scripts_path.md precedent), prod backfill belongs in server/src/scripts/ with chunked windows + dry-run, not a release-command migration. At minimum, chunk by started_at window inside a DO $$ block.

2. Status-priority contract drift between PR and #4250's test.
Migration 472:80-122 implements first-newer-wins via last_checked_at < EXCLUDED.last_checked_at guards on the upsert. PR #4250's compliance-db-last-write-wins.test.ts asserts plain last-write-wins. The PR body says "heartbeat always wins on freshness" — pick one. The test in #4250 will pass at #4250 time and start failing the contract this migration enforces. Reconcile before merge.

3. Partial dual-write window is unguarded.
Between #4250 deploy (canonical writes start) and this PR's deploy (backfill runs), owner runs write to BOTH canonical and agent_test_history. The migration's WHERE NOT EXISTS … backfill_source_id guard correctly skips already-backfilled rows — but owner runs in that window have no backfill_source_id linkage, so they'll be inserted twice into agent_compliance_runs (once by #4250's live write path, once by the backfill).

Add a guard: AND ath.started_at < (SELECT MIN(tested_at) FROM agent_compliance_runs WHERE triggered_by='owner_test') or equivalent — only backfill rows from before the canonical-write cutover.

#4263 is clean and #4268 is solid; this PR is the chain blocker.

@EmmaLouise2018 EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from 47b26d4 to fd7d406 Compare May 9, 2026 01:07
bokelley added a commit that referenced this pull request May 11, 2026
…o keys (#4364)

* fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys

The compliance heartbeat has been writing zero rows to
agent_storyboard_status since the SDK switched comply() to storyboard-
driven testing. The SDK emits one TestResult per phase of each storyboard,
keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario
(see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation
walked the YAML's per-step `comply_scenario` field (bare names like
`signals_flow`, `capability_discovery`) and looked them up in the SDK's
scenario map. Every lookup missed → testedCount === 0 → every storyboard
skipped at the `continue` guard.

Effect across the registry:
  agent_storyboard_status total rows: 6  (across 4 agents)
  rows written by triggered_by='heartbeat': 0
  rows surviving were legacy bare-name keys from old manual runs

This silently broke the AAO Verified badge pipeline (no storyboard rows
→ deriveVerificationStatus has nothing to verify against) and every
agent's dashboard `storyboards_passing: 0 / N` was misleading: the
runner wasn't failing storyboards, the parser was dropping them.

Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios
clean but showing `degraded` because specialism_status.signal-owned read
'untested' from a never-populated agent_storyboard_status row.

Fix: read SDK output directly. Group scenarios by storyboard id, roll
per-step pass counts up from each phase's `steps` array, fall back to
phase-level counts when steps are absent. The `storyboardIds` override
is preserved for explicit-IDs callers that need an `untested` entry
when the runner didn't run a requested storyboard. The unused YAML
`comply_scenario` field is no longer load-bearing for status mapping
(the SDK already knows which storyboards it ran).

Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback,
legacy bare-name skip, empty input, and explicit-IDs untested gap.

Stack note: this is orthogonal to Emma's #4247 compliance-state
unification stack (#4250, #4263, #4264, #4268, #4274) which collapses
agent_test_history into agent_compliance_runs. Different files; rebases
cleanly in either order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(scripts): test-comply-storyboard-statuses — local harness for the fix

Runs comply() against an agent URL and prints what
deriveStoryboardStatuses would produce, without DB writes. Used to
validate the SDK-6.x scenario-key fix against real agents
(adcp-signals-adaptor.evgeny-193.workers.dev/mcp and
wonderstruck.sales-agent.scope3.com/mcp) before merging.

Will stay useful for future SDK upgrades that touch scenario emission
or storyboard-track aggregation — same pattern as the
diagnose-agent-comply-queue script from #4361.

Usage:
  npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests

Addresses code-reviewer feedback on PR #4364:
- JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total
  are not directly comparable across rows (some rows are real step counts,
  some are phase-level fallbacks when the SDK omits per-step data).
- Comment pinning the storyboard-id invariant (flat ids, no `/`) so the
  indexOf split stays correct as new storyboards land.
- Defensive `result.tracks ?? []` so a malformed result doesn't throw.
- Hoist `storyboardIds && length > 0` into a single `hasExplicitIds`
  const used at both the toEmit decision and the no-data fallback.
- Three new test cases:
  * same storyboard split across multiple tracks aggregates correctly
  * result.tracks absent → []
  * non-string scenario values (null, number) → skipped without throwing

12/12 vitest passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch 2 times, most recently from a40cddd to 48f96aa Compare May 11, 2026 10:00
Base automatically changed from EmmaLouise2018/unification-pr2-dashboard-triggered-by to main May 11, 2026 10:04
PR 3 of the #4247 unification stack.

Migration 472 backfills every agent_test_history row with a user_id
into agent_compliance_runs as triggered_by='owner_test', carrying the
source id in observations_json.backfill_source_id for idempotent re-runs.
Each agent's latest backfilled row upserts into agent_compliance_status
so the dashboard immediately reflects a real verdict for agents tested
through Addie pre-PR-#4250.

evaluate_agent_quality stops calling recordTest() when the caller owns
the agent — that was the dual-write that #4247 is closing. recordTest
is retained ONLY for third-party runs so strangers testing someone
else's agent still have a session-scoped audit trail.

Drop of agent_test_history table is deferred behind the 14-day soak
from #4250 + 7-day soak from #4263 + S3 cold-storage export of
non-owner rows. Migration 472 documents this in its trailing comment.

Stacked on #4263#4250.
@EmmaLouise2018 EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr3-backfill-and-drop-test-history branch from 0f0104c to 5618e44 Compare May 11, 2026 13:55
…drop status upsert

Addresses code review on PR #4264.

1. Move backfill out of DDL migration into server/src/scripts/. A bulk
   INSERT in a release-command migration holds locks for the duration on
   any prod-sized agent_test_history. The script chunks by primary-key
   id ascending (default 1000 rows / chunk, 100ms sleep) so heartbeat and
   runtime writes never wait on a long-running lock. Tunable via
   --chunk-size / --sleep-ms. --dry-run counts eligible rows without
   writing.

2. Add cutover guard so the script doesn't double-write owner runs that
   landed via PR #4250's runtime path during the
   PR-#4250-deployed-but-script-not-yet-run window. The guard skips any
   agent_test_history row with started_at >= MIN(tested_at) where
   triggered_by='owner_test' in agent_compliance_runs. Fresh deploys
   with no canonical owner_test rows treat cutover as +infinity and
   backfill all eligible rows.

3. Drop the agent_compliance_status upsert. Backfilling historical
   history rows should not retroactively change "current" status; the
   runtime canonical write path maintains that table. Resolves the
   contract-drift concern between PR #4250's last-write-wins test and
   the previous first-newer-wins guard in the migration — the script
   only writes to the history table, leaving the status table's runtime
   contract intact.

Idempotency unchanged: observations_json.backfill_source_id + WHERE NOT
EXISTS guard means re-runs are no-ops on already-backfilled rows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants