Skip to content

feat(eval-harness): add scenario-driven eval harness package#33

Open
ankitdas-volgapartners wants to merge 4 commits into
affandar:mainfrom
volga-partners:feat/eval-harness
Open

feat(eval-harness): add scenario-driven eval harness package#33
ankitdas-volgapartners wants to merge 4 commits into
affandar:mainfrom
volga-partners:feat/eval-harness

Conversation

@ankitdas-volgapartners
Copy link
Copy Markdown
Contributor

@ankitdas-volgapartners ankitdas-volgapartners commented May 20, 2026

Summary

Adds packages/eval-harness/ — a scenario-driven eval harness for PilotSwarm agents, tool use, durable waits, session recovery, safety prompts, prompt variants, and model/run regressions. Ships managed live execution with CMS/tool evidence capture, schema-validated configs (zod), bundled run plans, file reporters, and a downstream builder-agent skill template.

The package is purely additive — nothing in packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/, or packages/ui-* is touched. The harness creates its own PilotSwarmClient/PilotSwarmWorker pool internally and runs against ephemeral schemas in the same Postgres a worker uses.

Motivation

PilotSwarm has rich durable execution semantics (waits, dehydration, hydration, worker restart, sub-agents, CMS evidence) but no first-class eval harness to gate behavior across model/prompt/tool changes. Regressions in turn execution, tool sequencing, or durable trajectory state surface as flaky integration tests or get caught only in downstream apps. Downstream apps face the same gap: every team builds its own eval scaffolding.

This PR brings:

  1. A managed live runner that owns the worker pool, creates sessions, records CMS/tool evidence, applies chaos (worker restart, during-wait and after-tool-call-N injection), and tears down ephemeral schemas — all from a single CLI invocation.
  2. A rigid 3-layer config hierarchy (run config -> manifest -> scenario config) so behavior overrides cannot creep into the wrong layer. Manifests can only select scenarios and add tags. CLI flags override run config only — never rewrite scenario config.
  3. Evidence-first LLMJudge (OpenAI + Copilot providers) returning reason → evidence → issues → verdict → confidence. No numeric score, no pass threshold.
  4. A downstream story: the package is npm-publishable as pilotswarm-eval-harness (MIT) with a run-eval bin, a builder-agent skill template, and a downstream guide so app teams drop in their own scenarios in minutes.

What's added

packages/eval-harness/
├── bin/
│   ├── run-eval.sh                         CLI entry (rebuilds dist on stale source)
│   └── run-eval.ts                         arg parsing, plugin loader, exit codes
├── src/
│   ├── index.ts                            public API surface
│   ├── registry.ts                         scenario kinds / checks / tools / drivers / reporters
│   ├── schema/
│   │   ├── config.ts                       RunConfig (zod, strict-ish)
│   │   ├── manifest.ts                     JSONL manifest directives + cycle detection
│   │   ├── scenario.ts                     discriminated-union scenario schemas + semantic validation
│   │   └── check-types.ts                  built-in check schemas
│   ├── drivers/
│   │   ├── live.ts                         single-scenario live driver (legacy)
│   │   ├── fake.ts                         deterministic preflight driver
│   │   ├── scripted.ts                     plugin-driven observations
│   │   ├── pilotswarm.ts                   attach driver (alias kept for old configs)
│   │   └── chaos.ts                        reserved diagnostic driver
│   ├── engine/
│   │   ├── managed-live-runner.ts          worker pool, isolation modes, chaos controller
│   │   ├── run-manifest.ts                 orchestration entry; effective config materialization
│   │   ├── discover.ts                     scenario + manifest discovery (glob, include/exclude)
│   │   ├── meta-scenarios.ts               prompt-variant + ablation expansion
│   │   ├── post-run.ts                     trajectory summary
│   │   ├── check-runner.ts                 check evaluation pipeline
│   │   └── …agent-inventory, cost-budget, isolation, effective-config, prompt-loading
│   ├── checks/
│   │   ├── index.ts                        built-in deterministic checks
│   │   └── llm-judge.ts                    evidence-first judge (OpenAI + Copilot)
│   ├── reporters/
│   │   ├── console.ts, jsonl.ts, markdown.ts, output.ts
│   └── tools/defaults.ts                   bundled scenario tools
├── scenarios/                              44 bundled JSON scenarios
├── runs/                                   bundled run plans (smoke, critical-path, all, nightly, …)
├── docs/                                   QUICKSTART, SCHEMA, PLUGINS, DOWNSTREAM-GUIDE, TROUBLESHOOTING
├── test/                                   19 vitest files (111 tests)
├── package.json, tsconfig.json, vitest.config.ts, README.md

Drivers

Driver Use it for Notes
live Managed PilotSwarm E2E execution Owns the worker pool; requires DATABASE_URL, GITHUB_TOKEN, Postgres
fake Explicit local validation (--fake or --driver=fake) Deterministic, no infra or credentials
scripted Plugin-defined deterministic observations App-specific test fixtures
attach (alias pilotswarm) Client-driven diagnostic vs. already-running worker Forensics; not the canonical E2E path
chaos Reserved diagnostic Production chaos goes through managed live

Bundled run plans

Run Purpose Driver
smoke Live smoke over representative scenarios live
critical-path Live durable + safety + multi-turn + agent-behavior gate live
all Every checked-in scenario through managed live live
nightly Larger diagnostic run with meta scenarios live
durable-cross-model Durable model-classification plan live
attach-live Diagnostic vs. an already-running worker attach

live-smoke, live-critical-path, live-all, live-e2e are backward-compat aliases.

Surface highlights

  • Public API (src/index.ts): discoverScenarios, runManifest, runScenario, evaluateCheck, evaluateChecks, register{ScenarioKind,CheckType,Tool,Driver,Reporter}, all schemas + types. Internal helpers stay internal so v0.1.0 is the stable surface.
  • Run output bundle: REPORT.md, summary.json, run-config.json (redacted effective config), machine/results.jsonl, plus per-scenario README.md / result.json / timeline.md / transcript.md / cms-events.json / tool-calls.json / agent-sessions.json.
  • Redaction: redactForArtifact strips API keys, tokens, cookies, and DB credentials (store, databaseurl, connectionstring, dsn, pgpassword, etc.).
  • OPENAI_BASE_URL safety: non-default values emit a one-time warning before OPENAI_API_KEY is sent.

What's not changing

  • packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/, packages/ui-core/, packages/ui-react/ — untouched.
  • No SDK API changes, no schema changes, no CMS schema changes.
  • Existing tests under packages/sdk/test/local/ and packages/mcp-server/ are unchanged.
  • pilotswarm-sdk peer dependency satisfied by current SDK version (^0.1.29).
  • The only repo-level deltas are 3 opt-out-compatible additions:
Surface Change Opt-out
Root package.json build chain Appends --workspace=pilotswarm-eval-harness n/a (additive)
scripts/run-tests.sh Adds run_eval_harness_tests before SDK suites SKIP_EVAL_HARNESS_TESTS=1
Root .gitignore Adds *.eval-results/ n/a

How to try it

From the repo

npm install
npm run build --workspace=pilotswarm-eval-harness

# Fake preflight (no Postgres, no GITHUB_TOKEN)
packages/eval-harness/bin/run-eval.sh --run=smoke --fake

# Live smoke (needs DATABASE_URL + GITHUB_TOKEN + Postgres)
set -a; source .env; set +a
packages/eval-harness/bin/run-eval.sh --run=smoke

# Full sweep
packages/eval-harness/bin/run-eval.sh --run=all

From a downstream app

npm install pilotswarm-eval-harness

# Preflight a downstream run config
npm exec run-eval -- --config=eval/runs/smoke/config.json --fake --require=eval/eval-plugins.js

# Live downstream run
npm exec run-eval -- --config=eval/runs/smoke/config.json --require=eval/eval-plugins.js

packages/eval-harness/docs/DOWNSTREAM-GUIDE.md walks through the full downstream setup; templates/builder-agents/skills/eval-harness/SKILL.md is the matching builder-agent skill.

Testing

  • npm run build --workspace=pilotswarm-eval-harness clean (NodeNext, strict).
  • npm test --workspace=pilotswarm-eval-harness: 111/111 passing across 19 vitest files (~1.7s).
  • Fake preflight runs across all bundled scenarios without infra or credentials.

Live runs (--run=smoke, --run=critical-path, --run=all) need Postgres + GITHUB_TOKEN and were not exercised in this session — recommend a reviewer or CI cover those before merge.

Reviewer guidance

Suggested review order:

  1. src/schema/{config,manifest,scenario}.ts — the contract surface; everything downstream of discovery validates against these.
  2. src/engine/managed-live-runner.ts — worker pool ownership, isolation modes, chaos controller (timer lifecycle on the error path is the highest-risk surface).
  3. src/engine/run-manifest.ts + src/engine/discover.ts — orchestration entry and manifest cycle / glob handling.
  4. src/checks/llm-judge.ts — provider routing, evidence-first prompt, OPENAI_BASE_URL handling.
  5. src/reporters/output.ts — redaction patterns (API keys, tokens, DB credentials) and result bundle layout.
  6. src/index.ts — public API surface; confirm nothing internal leaks.
  7. bin/run-eval.{sh,ts} — CLI flags, exit codes, plugin loader.
  8. templates/builder-agents/skills/eval-harness/ — downstream skill template; confirm it matches the canonical CLI flags.

Out of scope / follow-ups

  • Chaos types beyond worker-restart (worker-crash, child-crash, tool-timeout, dehydrate-now) are schema-reserved but rejected at runtime in v1.
  • Live model-provider sweeps and durability gating beyond the bundled smoke/critical-path plans are framed in docs/proposals-impl/eval-harness.md but not enabled by default.
  • Trimming bundled run-plan compatibility aliases (live-smoke, live-critical-path, live-all, live-e2e) once downstream pins migrate.

Introduces packages/eval-harness with src, bin, registry, schemas
(zod), drivers (live, fake, scripted, attach, chaos), checks
(including LLM judge with OPENAI_BASE_URL warning), reporters
(console, jsonl, markdown), and the managed-live runner with
fixed chaos timer lifecycle.

Wires the workspace into the root build chain and adds
*.eval-results/ to the root gitignore.
Ships JSON scenarios across single-turn, multi-turn, durable
trajectory, safety, meta (prompt-variant), agent-behavior, and
live e2e categories. Provides bundled run plans (smoke,
critical-path, all, nightly, durable-cross-model, attach-live,
plus live-* compatibility aliases) that the run-eval CLI
resolves via --run=<name>.
Covers schema parsing, check evaluation, registry wiring,
reporters (including redaction), CLI argument handling, the
managed live runner, the live and pilotswarm drivers, the LLM
judge provider, custom plugin extensions, agent inventory, and
the bundled scenario corpus.

Adds run_eval_harness_tests to scripts/run-tests.sh with a
SKIP_EVAL_HARNESS_TESTS=1 escape hatch.
…docs

Adds packages/eval-harness/README.md plus the QUICKSTART,
SCHEMA, PLUGINS, DOWNSTREAM-GUIDE, and TROUBLESHOOTING docs.
Records the eval harness in docs/proposals-impl/eval-harness.md
and indexes it in docs/proposals-impl/README.md. Ships the
downstream builder-agent skill at templates/builder-agents/
skills/eval-harness with a working scenario, run config, and
plugin example, and links it from the builder-agents README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant