fix(addie): backfill owner test history + stop dual-write for owner runs#4264
fix(addie): backfill owner test history + stop dual-write for owner runs#4264EmmaLouise2018 wants to merge 2 commits into
Conversation
|
Code review (expert pass): block — three issues that need addressing before this lands. 1. Backfill is not chunked or transactional safety-netted. 2. Status-priority contract drift between PR and #4250's test. 3. Partial dual-write window is unguarded. Add a guard: #4263 is clean and #4268 is solid; this PR is the chain blocker. |
47b26d4 to
fd7d406
Compare
…o keys (#4364) * fix(compliance): rewrite deriveStoryboardStatuses for SDK 6.x scenario keys The compliance heartbeat has been writing zero rows to agent_storyboard_status since the SDK switched comply() to storyboard- driven testing. The SDK emits one TestResult per phase of each storyboard, keyed `<storyboard_id>/<phase_id>` in result.tracks[].scenarios[].scenario (see @adcp/sdk compliance/storyboard-tracks.ts). The old implementation walked the YAML's per-step `comply_scenario` field (bare names like `signals_flow`, `capability_discovery`) and looked them up in the SDK's scenario map. Every lookup missed → testedCount === 0 → every storyboard skipped at the `continue` guard. Effect across the registry: agent_storyboard_status total rows: 6 (across 4 agents) rows written by triggered_by='heartbeat': 0 rows surviving were legacy bare-name keys from old manual runs This silently broke the AAO Verified badge pipeline (no storyboard rows → deriveVerificationStatus has nothing to verify against) and every agent's dashboard `storyboards_passing: 0 / N` was misleading: the runner wasn't failing storyboards, the parser was dropping them. Surfaced by escalation #329: Evgeny's agent was running 30/30 scenarios clean but showing `degraded` because specialism_status.signal-owned read 'untested' from a never-populated agent_storyboard_status row. Fix: read SDK output directly. Group scenarios by storyboard id, roll per-step pass counts up from each phase's `steps` array, fall back to phase-level counts when steps are absent. The `storyboardIds` override is preserved for explicit-IDs callers that need an `untested` entry when the runner didn't run a requested storyboard. The unused YAML `comply_scenario` field is no longer load-bearing for status mapping (the SDK already knows which storyboards it ran). Tests: 9 cases covering all-pass, partial, all-fail, phase-only fallback, legacy bare-name skip, empty input, and explicit-IDs untested gap. Stack note: this is orthogonal to Emma's #4247 compliance-state unification stack (#4250, #4263, #4264, #4268, #4274) which collapses agent_test_history into agent_compliance_runs. Different files; rebases cleanly in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(scripts): test-comply-storyboard-statuses — local harness for the fix Runs comply() against an agent URL and prints what deriveStoryboardStatuses would produce, without DB writes. Used to validate the SDK-6.x scenario-key fix against real agents (adcp-signals-adaptor.evgeny-193.workers.dev/mcp and wonderstruck.sales-agent.scope3.com/mcp) before merging. Will stay useful for future SDK upgrades that touch scenario emission or storyboard-track aggregation — same pattern as the diagnose-agent-comply-queue script from #4361. Usage: npx tsx server/src/scripts/test-comply-storyboard-statuses.ts <agent-url> [<agent-url> ...] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(compliance): code review nits — clarify steps doc, hoist explicit-ids check, add 3 edge tests Addresses code-reviewer feedback on PR #4364: - JSDoc on deriveStoryboardStatuses now calls out that steps_passed/total are not directly comparable across rows (some rows are real step counts, some are phase-level fallbacks when the SDK omits per-step data). - Comment pinning the storyboard-id invariant (flat ids, no `/`) so the indexOf split stays correct as new storyboards land. - Defensive `result.tracks ?? []` so a malformed result doesn't throw. - Hoist `storyboardIds && length > 0` into a single `hasExplicitIds` const used at both the toEmit decision and the no-data fallback. - Three new test cases: * same storyboard split across multiple tracks aggregates correctly * result.tracks absent → [] * non-string scenario values (null, number) → skipped without throwing 12/12 vitest passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a40cddd to
48f96aa
Compare
PR 3 of the #4247 unification stack. Migration 472 backfills every agent_test_history row with a user_id into agent_compliance_runs as triggered_by='owner_test', carrying the source id in observations_json.backfill_source_id for idempotent re-runs. Each agent's latest backfilled row upserts into agent_compliance_status so the dashboard immediately reflects a real verdict for agents tested through Addie pre-PR-#4250. evaluate_agent_quality stops calling recordTest() when the caller owns the agent — that was the dual-write that #4247 is closing. recordTest is retained ONLY for third-party runs so strangers testing someone else's agent still have a session-scoped audit trail. Drop of agent_test_history table is deferred behind the 14-day soak from #4250 + 7-day soak from #4263 + S3 cold-storage export of non-owner rows. Migration 472 documents this in its trailing comment. Stacked on #4263 → #4250.
0f0104c to
5618e44
Compare
…drop status upsert Addresses code review on PR #4264. 1. Move backfill out of DDL migration into server/src/scripts/. A bulk INSERT in a release-command migration holds locks for the duration on any prod-sized agent_test_history. The script chunks by primary-key id ascending (default 1000 rows / chunk, 100ms sleep) so heartbeat and runtime writes never wait on a long-running lock. Tunable via --chunk-size / --sleep-ms. --dry-run counts eligible rows without writing. 2. Add cutover guard so the script doesn't double-write owner runs that landed via PR #4250's runtime path during the PR-#4250-deployed-but-script-not-yet-run window. The guard skips any agent_test_history row with started_at >= MIN(tested_at) where triggered_by='owner_test' in agent_compliance_runs. Fresh deploys with no canonical owner_test rows treat cutover as +infinity and backfill all eligible rows. 3. Drop the agent_compliance_status upsert. Backfilling historical history rows should not retroactively change "current" status; the runtime canonical write path maintains that table. Resolves the contract-drift concern between PR #4250's last-write-wins test and the previous first-newer-wins guard in the migration — the script only writes to the history table, leaving the status table's runtime contract intact. Idempotency unchanged: observations_json.backfill_source_id + WHERE NOT EXISTS guard means re-runs are no-ops on already-backfilled rows.
PR 3 of the #4247 unification stack. Stacked on #4263 → #4250.
Summary
Two coupled changes that close the dual-write bug #4247 is targeting:
Backfill historical owner-triggered tests into canonical state. Migration
472_backfill_agent_test_history_to_compliance_runs.sqlcopies every `agent_test_history` row with `user_id IS NOT NULL` into `agent_compliance_runs` as `triggered_by = 'owner_test'`, carrying the source row id in `observations_json.backfill_source_id` for idempotent re-runs (`WHERE NOT EXISTS` guard). Each agent's most-recent backfilled row also upserts into `agent_compliance_status` — last-write-wins on `(agent_url)` per the contract pinned in #4250's race test — so the dashboard's compliance tile immediately reflects a real verdict for any agent that was tested through Addie pre-PR-#4250 and never ran the heartbeat.Stop the dual write for owner runs. `evaluate_agent_quality` no longer calls `agentContextDb.recordTest()` when the caller owns the agent — that's exactly the dual-write that #4247 is closing. The legacy `recordTest` call is retained ONLY for third-party runs so a stranger who tests someone else's agent still has a session-scoped audit trail in their own `agent_test_history`. Owner-triggered runs persist exclusively to canonical state going forward.
Why
PR #4250 made owner runs write to the canonical tables but kept the legacy `recordTest` write for backward compatibility. Two writers per owner run → drift surface. This PR removes the legacy write for owner runs and backfills the rows that were already written through the legacy path so we don't lose history.
Brian's bar: don't lose data silently. Migration 472's idempotent guard + `backfill_source_id` provenance means a re-run is a no-op and any future audit can trace back to the original `agent_test_history` row.
Out of scope (deferred)
Migration 472 documents this in its trailing comment.
Stacked on
Merge order: #4250 → #4263 → this PR.
Test plan