feat(eval-harness): add scenario-driven eval harness package by ankitdas-volgapartners · Pull Request #33 · affandar/PilotSwarm

ankitdas-volgapartners · 2026-05-20T05:57:36Z

Summary

Adds packages/eval-harness/ — a scenario-driven eval harness for PilotSwarm agents, tool use, durable waits, session recovery, safety prompts, prompt variants, and model/run regressions. Ships managed live execution with CMS/tool evidence capture, schema-validated configs (zod), bundled run plans, file reporters, and a downstream builder-agent skill template.

The package is purely additive — nothing in packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/, or packages/ui-* is touched. The harness creates its own PilotSwarmClient/PilotSwarmWorker pool internally and runs against ephemeral schemas in the same Postgres a worker uses.

Motivation

PilotSwarm has rich durable execution semantics (waits, dehydration, hydration, worker restart, sub-agents, CMS evidence) but no first-class eval harness to gate behavior across model/prompt/tool changes. Regressions in turn execution, tool sequencing, or durable trajectory state surface as flaky integration tests or get caught only in downstream apps. Downstream apps face the same gap: every team builds its own eval scaffolding.

This PR brings:

A managed live runner that owns the worker pool, creates sessions, records CMS/tool evidence, applies chaos (worker restart, during-wait and after-tool-call-N injection), and tears down ephemeral schemas — all from a single CLI invocation.
A rigid 3-layer config hierarchy (run config -> manifest -> scenario config) so behavior overrides cannot creep into the wrong layer. Manifests can only select scenarios and add tags. CLI flags override run config only — never rewrite scenario config.
Evidence-first LLMJudge (OpenAI + Copilot providers) returning reason → evidence → issues → verdict → confidence. No numeric score, no pass threshold.
A downstream story: the package is npm-publishable as pilotswarm-eval-harness (MIT) with a run-eval bin, a builder-agent skill template, and a downstream guide so app teams drop in their own scenarios in minutes.

What's added

packages/eval-harness/
├── bin/
│   ├── run-eval.sh                         CLI entry (rebuilds dist on stale source)
│   └── run-eval.ts                         arg parsing, plugin loader, exit codes
├── src/
│   ├── index.ts                            public API surface
│   ├── registry.ts                         scenario kinds / checks / tools / drivers / reporters
│   ├── schema/
│   │   ├── config.ts                       RunConfig (zod, strict-ish)
│   │   ├── manifest.ts                     JSONL manifest directives + cycle detection
│   │   ├── scenario.ts                     discriminated-union scenario schemas + semantic validation
│   │   └── check-types.ts                  built-in check schemas
│   ├── drivers/
│   │   ├── live.ts                         single-scenario live driver (legacy)
│   │   ├── fake.ts                         deterministic preflight driver
│   │   ├── scripted.ts                     plugin-driven observations
│   │   ├── pilotswarm.ts                   attach driver (alias kept for old configs)
│   │   └── chaos.ts                        reserved diagnostic driver
│   ├── engine/
│   │   ├── managed-live-runner.ts          worker pool, isolation modes, chaos controller
│   │   ├── run-manifest.ts                 orchestration entry; effective config materialization
│   │   ├── discover.ts                     scenario + manifest discovery (glob, include/exclude)
│   │   ├── meta-scenarios.ts               prompt-variant + ablation expansion
│   │   ├── post-run.ts                     trajectory summary
│   │   ├── check-runner.ts                 check evaluation pipeline
│   │   └── …agent-inventory, cost-budget, isolation, effective-config, prompt-loading
│   ├── checks/
│   │   ├── index.ts                        built-in deterministic checks
│   │   └── llm-judge.ts                    evidence-first judge (OpenAI + Copilot)
│   ├── reporters/
│   │   ├── console.ts, jsonl.ts, markdown.ts, output.ts
│   └── tools/defaults.ts                   bundled scenario tools
├── scenarios/                              44 bundled JSON scenarios
├── runs/                                   bundled run plans (smoke, critical-path, all, nightly, …)
├── docs/                                   QUICKSTART, SCHEMA, PLUGINS, DOWNSTREAM-GUIDE, TROUBLESHOOTING
├── test/                                   19 vitest files (111 tests)
├── package.json, tsconfig.json, vitest.config.ts, README.md

Drivers

Driver	Use it for	Notes
`live`	Managed PilotSwarm E2E execution	Owns the worker pool; requires `DATABASE_URL`, `GITHUB_TOKEN`, Postgres
`fake`	Explicit local validation (`--fake` or `--driver=fake`)	Deterministic, no infra or credentials
`scripted`	Plugin-defined deterministic observations	App-specific test fixtures
`attach` (alias `pilotswarm`)	Client-driven diagnostic vs. already-running worker	Forensics; not the canonical E2E path
`chaos`	Reserved diagnostic	Production chaos goes through managed `live`

Bundled run plans

Run	Purpose	Driver
`smoke`	Live smoke over representative scenarios	`live`
`critical-path`	Live durable + safety + multi-turn + agent-behavior gate	`live`
`all`	Every checked-in scenario through managed live	`live`
`nightly`	Larger diagnostic run with meta scenarios	`live`
`durable-cross-model`	Durable model-classification plan	`live`
`attach-live`	Diagnostic vs. an already-running worker	`attach`

live-smoke, live-critical-path, live-all, live-e2e are backward-compat aliases.

Surface highlights

Public API (src/index.ts): discoverScenarios, runManifest, runScenario, evaluateCheck, evaluateChecks, register{ScenarioKind,CheckType,Tool,Driver,Reporter}, all schemas + types. Internal helpers stay internal so v0.1.0 is the stable surface.
Run output bundle: REPORT.md, summary.json, run-config.json (redacted effective config), machine/results.jsonl, plus per-scenario README.md / result.json / timeline.md / transcript.md / cms-events.json / tool-calls.json / agent-sessions.json.
Redaction: redactForArtifact strips API keys, tokens, cookies, and DB credentials (store, databaseurl, connectionstring, dsn, pgpassword, etc.).
OPENAI_BASE_URL safety: non-default values emit a one-time warning before OPENAI_API_KEY is sent.

What's not changing

packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/, packages/ui-core/, packages/ui-react/ — untouched.
No SDK API changes, no schema changes, no CMS schema changes.
Existing tests under packages/sdk/test/local/ and packages/mcp-server/ are unchanged.
pilotswarm-sdk peer dependency satisfied by current SDK version (^0.1.29).
The only repo-level deltas are 3 opt-out-compatible additions:

Surface	Change	Opt-out
Root `package.json` build chain	Appends `--workspace=pilotswarm-eval-harness`	n/a (additive)
`scripts/run-tests.sh`	Adds `run_eval_harness_tests` before SDK suites	`SKIP_EVAL_HARNESS_TESTS=1`
Root `.gitignore`	Adds `*.eval-results/`	n/a

How to try it

From the repo

npm install
npm run build --workspace=pilotswarm-eval-harness

# Fake preflight (no Postgres, no GITHUB_TOKEN)
packages/eval-harness/bin/run-eval.sh --run=smoke --fake

# Live smoke (needs DATABASE_URL + GITHUB_TOKEN + Postgres)
set -a; source .env; set +a
packages/eval-harness/bin/run-eval.sh --run=smoke

# Full sweep
packages/eval-harness/bin/run-eval.sh --run=all

From a downstream app

npm install pilotswarm-eval-harness

# Preflight a downstream run config
npm exec run-eval -- --config=eval/runs/smoke/config.json --fake --require=eval/eval-plugins.js

# Live downstream run
npm exec run-eval -- --config=eval/runs/smoke/config.json --require=eval/eval-plugins.js

packages/eval-harness/docs/DOWNSTREAM-GUIDE.md walks through the full downstream setup; templates/builder-agents/skills/eval-harness/SKILL.md is the matching builder-agent skill.

Testing

npm run build --workspace=pilotswarm-eval-harness clean (NodeNext, strict).
npm test --workspace=pilotswarm-eval-harness: 111/111 passing across 19 vitest files (~1.7s).
Fake preflight runs across all bundled scenarios without infra or credentials.

Live runs (--run=smoke, --run=critical-path, --run=all) need Postgres + GITHUB_TOKEN and were not exercised in this session — recommend a reviewer or CI cover those before merge.

Reviewer guidance

Suggested review order:

src/schema/{config,manifest,scenario}.ts — the contract surface; everything downstream of discovery validates against these.
src/engine/managed-live-runner.ts — worker pool ownership, isolation modes, chaos controller (timer lifecycle on the error path is the highest-risk surface).
src/engine/run-manifest.ts + src/engine/discover.ts — orchestration entry and manifest cycle / glob handling.
src/checks/llm-judge.ts — provider routing, evidence-first prompt, OPENAI_BASE_URL handling.
src/reporters/output.ts — redaction patterns (API keys, tokens, DB credentials) and result bundle layout.
src/index.ts — public API surface; confirm nothing internal leaks.
bin/run-eval.{sh,ts} — CLI flags, exit codes, plugin loader.
templates/builder-agents/skills/eval-harness/ — downstream skill template; confirm it matches the canonical CLI flags.

Out of scope / follow-ups

Chaos types beyond worker-restart (worker-crash, child-crash, tool-timeout, dehydrate-now) are schema-reserved but rejected at runtime in v1.
Live model-provider sweeps and durability gating beyond the bundled smoke/critical-path plans are framed in docs/proposals-impl/eval-harness.md but not enabled by default.
Trimming bundled run-plan compatibility aliases (live-smoke, live-critical-path, live-all, live-e2e) once downstream pins migrate.

Introduces packages/eval-harness with src, bin, registry, schemas (zod), drivers (live, fake, scripted, attach, chaos), checks (including LLM judge with OPENAI_BASE_URL warning), reporters (console, jsonl, markdown), and the managed-live runner with fixed chaos timer lifecycle. Wires the workspace into the root build chain and adds *.eval-results/ to the root gitignore.

Ships JSON scenarios across single-turn, multi-turn, durable trajectory, safety, meta (prompt-variant), agent-behavior, and live e2e categories. Provides bundled run plans (smoke, critical-path, all, nightly, durable-cross-model, attach-live, plus live-* compatibility aliases) that the run-eval CLI resolves via --run=<name>.

Covers schema parsing, check evaluation, registry wiring, reporters (including redaction), CLI argument handling, the managed live runner, the live and pilotswarm drivers, the LLM judge provider, custom plugin extensions, agent inventory, and the bundled scenario corpus. Adds run_eval_harness_tests to scripts/run-tests.sh with a SKIP_EVAL_HARNESS_TESTS=1 escape hatch.

…docs Adds packages/eval-harness/README.md plus the QUICKSTART, SCHEMA, PLUGINS, DOWNSTREAM-GUIDE, and TROUBLESHOOTING docs. Records the eval harness in docs/proposals-impl/eval-harness.md and indexes it in docs/proposals-impl/README.md. Ships the downstream builder-agent skill at templates/builder-agents/ skills/eval-harness with a working scenario, run config, and plugin example, and links it from the builder-agents README.

ankitdas-volgapartners added 4 commits May 20, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval-harness): add scenario-driven eval harness package#33

feat(eval-harness): add scenario-driven eval harness package#33
ankitdas-volgapartners wants to merge 4 commits into
affandar:mainfrom
volga-partners:feat/eval-harness

ankitdas-volgapartners commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankitdas-volgapartners commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's added

Drivers

Bundled run plans

Surface highlights

What's not changing

How to try it

From the repo

From a downstream app

Testing

Reviewer guidance

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ankitdas-volgapartners commented May 20, 2026 •

edited

Loading