Skip to content
6 changes: 4 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ The `evolution/<tier>/` directories form **a clean layering**: `evolution/core/`

1. CLI resolves `--skill <name>` to a `SKILL.md` via the `SkillSource` walk.
2. Eval dataset is built (synthetic LM gen / golden file / sessiondb mining).
3. Skill body wrapped as `dspy.Module`; GEPA optimizes it with `BudgetAwareProposer` injecting a char budget into the reflection prompt.
3. Skill body wrapped as `dspy.Module`. **Saturation pre-flight** (`evolution/core/saturation_check.py`) scores the baseline on the holdout + closed-loop suite, classifies into one of four bands, and aborts (or prompts) on non-`healthy` bands — `--no-saturation-check` to skip, `--force-saturation-check` to override the default-deny in non-interactive contexts. Then GEPA optimizes the candidate with `BudgetAwareProposer` injecting a char budget into the reflection prompt.
4. Knee-point Pareto selection walks the candidates within ε of the best valset score in `--knee-point-strategy` order. Default `val-best`: highest val first, smallest body as tiebreak. `smallest` (greedy parsimony) is available via the flag for users explicitly chasing compression.
5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects).
5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects). The post-GEPA holdout eval reuses the baseline scores from the pre-flight, so net cost stays ~zero when the pre-flight ran.

## What lives where

Expand All @@ -101,6 +101,7 @@ The `evolution/<tier>/` directories form **a clean layering**: `evolution/core/`
| Tool-flavored judge + tool metric | `evolution/tools/tool_judge.py` |
| Behavioral `dspy.Example` builder for closed-loop trainset | `evolution/core/behavioral_example.py` |
| Closed-loop verdict cache + deterministic feedback rendering | `evolution/core/closed_loop_feedback.py` |
| Saturation pre-flight (band classifier + Rich panel + interactive confirm) | `evolution/core/saturation_check.py` |
| Deploy gate (static + growth-quality) | `evolution/core/constraints.py` |
| Preset table + gate-decision persistence (shared by skill/tool) | `evolution/core/quality_gate.py` |
| Paired-bootstrap CI | `evolution/core/stats.py` |
Expand Down Expand Up @@ -268,6 +269,7 @@ Open questions deferred to future PRs (per `PLAN.md` deviation notes):
- GEPA Pareto-frontier checkpointing (so a `TimeoutError` mid-run doesn't lose all candidates)
- Skill-size-based reflection-LM timeout scaling
- BCa bootstrap upgrade once N≥20 routinely
- **GEPA acceptance-gate work** (deviation #8 follow-up): the saturation pre-flight (`evolution/core/saturation_check.py`) addresses the user-visible symptom on saturated baselines (abort before GEPA spends budget). The underlying mechanism gap — stochastic small-minibatch `sum()` acceptance discarding per-instance signal — is tracked as Path D/E/C in `reports/pareto_frontier_feasibility.md` and remains future work (likely an upstream DSPy or GEPA PR).

## When to consult which doc

Expand Down
2 changes: 2 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,8 @@ These descriptions are sent with every API call as part of the tool schema — e
7. **N=2 saturated baselines.** Weak-target hunt ran `evolve_tool` against `write_file` (98.8–99.2% holdout, 3 seeds, 1×/3× iter) and `search_files` (98.6% holdout). Both runs produced evolved descriptions byte-identical to the baseline — the knee-point picker correctly reverts to the unchanged baseline when GEPA's variants tie. The framework's tool-description pipeline is regression-catching, not improvement-finding, on these hand-tuned descriptions.
8. **Closed-loop signal can flow into reflection but doesn't change selection on saturated baselines.** The `--closed-loop-during-evolution` flag plumbs `ValidationReport`s into the GEPA reflection LM's feedback channel via the existing 5-arg metric protocol, opt-in, saturation-gated. Verified end-to-end on `write_file`: closed-loop fired (file mutated + restored), the reflection LM saw the verdict, GEPA still selected the baseline byte-for-byte. The bottleneck sits upstream of reflection — GEPA's `sum(judge_scores)` acceptance rule ties when every candidate hits 1.0 on a saturated minibatch. Extending the Pareto frontier into behavioral space (closed-loop tasks as additional training-set instances with their own per-instance scores so a candidate can stay on the frontier by winning behavioral tasks) is the structural direction that would address this; the cache + renderer added here are the natural building blocks for that work.

**Follow-up — Path F (saturation pre-flight) addresses the user-visible symptom, not the underlying mechanism.** A separate investigation (`reports/pareto_frontier_feasibility.md`, two spike runs) confirmed the deviation's diagnosis and reframed it: the bottleneck isn't frontier shape, it's GEPA's stochastic small-minibatch `sum()` acceptance gate discarding per-instance signal before it can move selection. Path F (`evolution/core/saturation_check.py`) ships the user-visible fix — detect the saturated case before GEPA starts, render a panel explaining why no improvement is possible, default-deny in non-interactive contexts. This prevents the wasted-budget UX without solving the mechanism gap. The mechanism-side fix (Pareto-dominance acceptance, larger minibatch, or stratified sampling) is tracked as "Path D/E/C" in the feasibility report and remains future work.

### Phase 3: System Prompt Evolution

**Goal:** Optimize the sections of the system prompt that guide agent behavior.
Expand Down
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,27 @@ uv run python -m evolution.tools.evolve_tool --tool X --manifest Y \

Env vars: `EVOLVED_PATH`, `BASELINE_PATH`, `RUN_DIR`, `TARGET_NAME`, `ARTIFACT_TYPE`. The hook runs under `/bin/sh -c` — interactive aliases are not available; invoke binaries by full name. Trust boundary: the command string is yours, do not pass strings you didn't write yourself.

### Saturation pre-flight (don't burn GEPA budget on hopeless runs)

By default, every `evolve_skill` / `evolve_tool` run does a pre-flight: score the baseline on the holdout (and the closed-loop suite, if `--closed-loop-during-evolution` is set), classify into one of four bands (`healthy` / `no_headroom` / `weak_signal` / `uniform_failure`), and refuse to spend GEPA budget on a baseline that's already saturated.

```
Saturation check: holdout=0.987 (50 ex), closed-loop=1.000 (7 tasks)
╭─── No measurable headroom ───────────╮
│ Band: no_headroom │
│ • Baseline already saturates the eval│
│ • Try a harder closed-loop suite │
│ • Sanity check: synthetic generator? │
╰──────────────────────────────────────╯
Non-interactive context; refusing to proceed.
Pass --force-saturation-check to override.
```

In interactive contexts, non-`healthy` bands prompt for confirmation (`Continue anyway? [y/N]`). In non-interactive contexts (no TTY on stdin — CI, background jobs, cron), the framework default-denies and exits cleanly with the override hint. Net cost is ~zero: the probe's holdout scores are reused at the post-GEPA evaluation site, so the baseline isn't re-scored at run end.

- `--no-saturation-check` skips the probe entirely (useful when you've already validated headroom externally)
- `--force-saturation-check` runs the probe + renders the panel but proceeds regardless of band

### Closed-loop validation (real agent on real tasks)

The framework's deploy gate scores evolved artifacts against an LM-judge on a synthetic eval set. That's a closed loop: an LM scoring another LM's output on tasks a third LM made up. To break the loop, point a real agent at a small task suite with the baseline and evolved artifacts and see whether real agent behavior actually shifted:
Expand Down
14 changes: 11 additions & 3 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,17 @@ flowchart LR
A[CLI<br/>--skill X] --> B[Resolve SKILL.md<br/>SkillSource]
B --> C[Build eval dataset<br/>synthetic / golden / sessiondb]
C --> D[Wrap as<br/>SkillModule dspy.Module]
D --> E[GEPA optimizer<br/>+ BudgetAwareProposer]
D --> SAT[Saturation pre-flight<br/>baseline holdout + closed-loop probe]
SAT --> SATB{band ==<br/>healthy?}
SATB -- no --> SATA[Rich panel + prompt<br/>or default-deny]
SATA -- abort --> Z[sys.exit 0]
SATA -- proceed --> E
SATB -- yes --> E[GEPA optimizer<br/>+ BudgetAwareProposer]
E --> F[Knee-point<br/>Pareto selection]
F --> G[Static<br/>constraints]
G --> H{pass?}
H -- no --> I[Write evolved_FAILED.md<br/>+ gate_decision.json]
H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 2]
H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
J --> K[Paired bootstrap<br/>per-example deltas]
K --> L[Growth-with-quality<br/>gate]
L --> M{deploy?}
Expand Down Expand Up @@ -166,7 +171,10 @@ When growth is below the free threshold, the gate degrades to "no-regression onl
### 9. Cost-ceiling kill switch
`LMTimingCallback` also drives a per-run `CostLedger` that accumulates per-call cost from litellm's `_hidden_params`. `--max-total-cost-usd <N>` arms the ledger; once the accumulated cost crosses `N`, the next LM call raises `CostCeilingExceeded` from `LMTimingCallback.on_lm_start`. The orchestrator catches this at the top level and writes a `decision="aborted"` `gate_decision.json` with `cost_at_abort_usd` + `cost_ceiling_usd` + `cost_summary`. Worst-case overshoot is one LM call past the ceiling.

### 10. Closed-loop validation as a separate surface
### 10. Saturation pre-flight as a separate concern from the gate
`evolution/core/saturation_check.py` runs BEFORE GEPA setup: scores the baseline on the holdout (and the closed-loop suite when configured), classifies into four bands (`healthy` / `no_headroom` / `weak_signal` / `uniform_failure`), and renders a Rich panel. Non-healthy bands prompt for confirmation in interactive contexts; default-deny in non-interactive contexts (no TTY) with a `--force-saturation-check` override. Skippable with `--no-saturation-check`. The probe's `holdout_per_example` is stashed and reused at the post-GEPA holdout site so net cost stays ~zero. Mirrors the `evolution/core/auth_check.py` pattern: pure helper returns a structured `SaturationReport`; rendering + exit handled by the call site. This is independent of the deploy gate (which runs AFTER GEPA on the evolved artifact) — the pre-flight is a "should we even start" decision; the gate is a "did we improve" decision.

### 11. Closed-loop validation as a separate surface
`evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Available three ways:
- **Post-gate veto** (`--benchmark-cmd "python -m evolution.validation.closed_loop ..."`) — runs after the deploy gate passes; nonzero exit flips the decision to reject with `reason="benchmark_failed"`.
- **Reflection feedback** (`--closed-loop-during-evolution <suite.jsonl> --closed-loop-mode feedback`) — `ClosedLoopFeedbackCache` runs the validator during the GEPA loop, saturation-gated, and the verdict is rendered into the reflection LM's input via the metric's `dspy.Prediction.feedback` string. Score channel untouched.
Expand Down
6 changes: 4 additions & 2 deletions docs/codebase_info.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ evolution/
│ ├── fitness.py # LLMJudge + GEPA-shaped metric + behavioral score helper
│ ├── lm_timing_callback.py # LM-call observability + cost ledger + cost-ceiling kill switch
│ ├── quality_gate.py # preset table + write_gate_decision (shared by skill/tool pipelines)
│ ├── saturation_check.py # pre-flight: classify baseline into healthy/no_headroom/weak_signal/uniform_failure + Rich panel + abort
│ ├── skill_sources.py # SkillSource protocol + 3 implementations
│ └── stats.py # paired_bootstrap CI
├── skills/ # Tier 1: skill-file evolution
Expand Down Expand Up @@ -90,7 +91,8 @@ evolution/
| `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper |
| `evolution/core/constraints.py` | ~320 | static + growth-with-quality + size constraints |
| `evolution/skills/budget_aware_proposer.py` | ~300 | char-budget reflection prompt |
| `evolution/core/closed_loop_feedback.py` | ~295 | cache + saturation gate + deterministic feedback block |
| `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) |
| `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm |
| `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch |
| `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check |
| `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision |
Expand All @@ -109,7 +111,7 @@ evolution/
| `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
| **Total** | **~9,000** | excludes empty `__init__.py` shims |

Test suite: 37 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **681 tests** collected.
Test suite: 55 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1076 tests** collected.

## Runtime dependencies

Expand Down
Loading
Loading