feat(eval-harness): add scenario-driven eval harness package#33
Open
ankitdas-volgapartners wants to merge 4 commits into
Open
feat(eval-harness): add scenario-driven eval harness package#33ankitdas-volgapartners wants to merge 4 commits into
ankitdas-volgapartners wants to merge 4 commits into
Conversation
Introduces packages/eval-harness with src, bin, registry, schemas (zod), drivers (live, fake, scripted, attach, chaos), checks (including LLM judge with OPENAI_BASE_URL warning), reporters (console, jsonl, markdown), and the managed-live runner with fixed chaos timer lifecycle. Wires the workspace into the root build chain and adds *.eval-results/ to the root gitignore.
Ships JSON scenarios across single-turn, multi-turn, durable trajectory, safety, meta (prompt-variant), agent-behavior, and live e2e categories. Provides bundled run plans (smoke, critical-path, all, nightly, durable-cross-model, attach-live, plus live-* compatibility aliases) that the run-eval CLI resolves via --run=<name>.
Covers schema parsing, check evaluation, registry wiring, reporters (including redaction), CLI argument handling, the managed live runner, the live and pilotswarm drivers, the LLM judge provider, custom plugin extensions, agent inventory, and the bundled scenario corpus. Adds run_eval_harness_tests to scripts/run-tests.sh with a SKIP_EVAL_HARNESS_TESTS=1 escape hatch.
…docs Adds packages/eval-harness/README.md plus the QUICKSTART, SCHEMA, PLUGINS, DOWNSTREAM-GUIDE, and TROUBLESHOOTING docs. Records the eval harness in docs/proposals-impl/eval-harness.md and indexes it in docs/proposals-impl/README.md. Ships the downstream builder-agent skill at templates/builder-agents/ skills/eval-harness with a working scenario, run config, and plugin example, and links it from the builder-agents README.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
packages/eval-harness/— a scenario-driven eval harness for PilotSwarm agents, tool use, durable waits, session recovery, safety prompts, prompt variants, and model/run regressions. Ships managed live execution with CMS/tool evidence capture, schema-validated configs (zod), bundled run plans, file reporters, and a downstream builder-agent skill template.The package is purely additive — nothing in
packages/sdk/,packages/portal/,packages/cli/,packages/mcp-server/, orpackages/ui-*is touched. The harness creates its ownPilotSwarmClient/PilotSwarmWorkerpool internally and runs against ephemeral schemas in the same Postgres a worker uses.Motivation
PilotSwarm has rich durable execution semantics (waits, dehydration, hydration, worker restart, sub-agents, CMS evidence) but no first-class eval harness to gate behavior across model/prompt/tool changes. Regressions in turn execution, tool sequencing, or durable trajectory state surface as flaky integration tests or get caught only in downstream apps. Downstream apps face the same gap: every team builds its own eval scaffolding.
This PR brings:
during-waitandafter-tool-call-Ninjection), and tears down ephemeral schemas — all from a single CLI invocation.run config -> manifest -> scenario config) so behavior overrides cannot creep into the wrong layer. Manifests can only select scenarios and add tags. CLI flags override run config only — never rewrite scenario config.reason → evidence → issues → verdict → confidence. No numeric score, no pass threshold.pilotswarm-eval-harness(MIT) with arun-evalbin, a builder-agent skill template, and a downstream guide so app teams drop in their own scenarios in minutes.What's added
Drivers
liveDATABASE_URL,GITHUB_TOKEN, Postgresfake--fakeor--driver=fake)scriptedattach(aliaspilotswarm)chaosliveBundled run plans
smokelivecritical-pathlivealllivenightlylivedurable-cross-modelliveattach-liveattachlive-smoke,live-critical-path,live-all,live-e2eare backward-compat aliases.Surface highlights
src/index.ts):discoverScenarios,runManifest,runScenario,evaluateCheck,evaluateChecks,register{ScenarioKind,CheckType,Tool,Driver,Reporter}, all schemas + types. Internal helpers stay internal sov0.1.0is the stable surface.REPORT.md,summary.json,run-config.json(redacted effective config),machine/results.jsonl, plus per-scenarioREADME.md/result.json/timeline.md/transcript.md/cms-events.json/tool-calls.json/agent-sessions.json.redactForArtifactstrips API keys, tokens, cookies, and DB credentials (store,databaseurl,connectionstring,dsn,pgpassword, etc.).OPENAI_API_KEYis sent.What's not changing
packages/sdk/,packages/portal/,packages/cli/,packages/mcp-server/,packages/ui-core/,packages/ui-react/— untouched.packages/sdk/test/local/andpackages/mcp-server/are unchanged.pilotswarm-sdkpeer dependency satisfied by current SDK version (^0.1.29).package.jsonbuild chain--workspace=pilotswarm-eval-harnessscripts/run-tests.shrun_eval_harness_testsbefore SDK suitesSKIP_EVAL_HARNESS_TESTS=1.gitignore*.eval-results/How to try it
From the repo
From a downstream app
packages/eval-harness/docs/DOWNSTREAM-GUIDE.mdwalks through the full downstream setup;templates/builder-agents/skills/eval-harness/SKILL.mdis the matching builder-agent skill.Testing
npm run build --workspace=pilotswarm-eval-harnessclean (NodeNext, strict).npm test --workspace=pilotswarm-eval-harness: 111/111 passing across 19 vitest files (~1.7s).Live runs (
--run=smoke,--run=critical-path,--run=all) need Postgres +GITHUB_TOKENand were not exercised in this session — recommend a reviewer or CI cover those before merge.Reviewer guidance
Suggested review order:
src/schema/{config,manifest,scenario}.ts— the contract surface; everything downstream of discovery validates against these.src/engine/managed-live-runner.ts— worker pool ownership, isolation modes, chaos controller (timer lifecycle on the error path is the highest-risk surface).src/engine/run-manifest.ts+src/engine/discover.ts— orchestration entry and manifest cycle / glob handling.src/checks/llm-judge.ts— provider routing, evidence-first prompt,OPENAI_BASE_URLhandling.src/reporters/output.ts— redaction patterns (API keys, tokens, DB credentials) and result bundle layout.src/index.ts— public API surface; confirm nothing internal leaks.bin/run-eval.{sh,ts}— CLI flags, exit codes, plugin loader.templates/builder-agents/skills/eval-harness/— downstream skill template; confirm it matches the canonical CLI flags.Out of scope / follow-ups
worker-restart(worker-crash,child-crash,tool-timeout,dehydrate-now) are schema-reserved but rejected at runtime in v1.docs/proposals-impl/eval-harness.mdbut not enabled by default.live-smoke,live-critical-path,live-all,live-e2e) once downstream pins migrate.