jramos · jramos · May 22, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -74,9 +74,9 @@ The `evolution/<tier>/` directories form **a clean layering**: `evolution/core/`
 
 1. CLI resolves `--skill <name>` to a `SKILL.md` via the `SkillSource` walk.
 2. Eval dataset is built (synthetic LM gen / golden file / sessiondb mining).
-3. Skill body wrapped as `dspy.Module`; GEPA optimizes it with `BudgetAwareProposer` injecting a char budget into the reflection prompt.
+3. Skill body wrapped as `dspy.Module`. **Saturation pre-flight** (`evolution/core/saturation_check.py`) scores the baseline on the holdout + closed-loop suite, classifies into one of four bands, and aborts (or prompts) on non-`healthy` bands — `--no-saturation-check` to skip, `--force-saturation-check` to override the default-deny in non-interactive contexts. Then GEPA optimizes the candidate with `BudgetAwareProposer` injecting a char budget into the reflection prompt.
 4. Knee-point Pareto selection walks the candidates within ε of the best valset score in `--knee-point-strategy` order. Default `val-best`: highest val first, smallest body as tiebreak. `smallest` (greedy parsimony) is available via the flag for users explicitly chasing compression.
-5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects).
+5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects). The post-GEPA holdout eval reuses the baseline scores from the pre-flight, so net cost stays ~zero when the pre-flight ran.
 
 ## What lives where
 
@@ -101,6 +101,7 @@ The `evolution/<tier>/` directories form **a clean layering**: `evolution/core/`
 | Tool-flavored judge + tool metric | `evolution/tools/tool_judge.py` |
 | Behavioral `dspy.Example` builder for closed-loop trainset | `evolution/core/behavioral_example.py` |
 | Closed-loop verdict cache + deterministic feedback rendering | `evolution/core/closed_loop_feedback.py` |
+| Saturation pre-flight (band classifier + Rich panel + interactive confirm) | `evolution/core/saturation_check.py` |
 | Deploy gate (static + growth-quality) | `evolution/core/constraints.py` |
 | Preset table + gate-decision persistence (shared by skill/tool) | `evolution/core/quality_gate.py` |
 | Paired-bootstrap CI | `evolution/core/stats.py` |
@@ -268,6 +269,7 @@ Open questions deferred to future PRs (per `PLAN.md` deviation notes):
 - GEPA Pareto-frontier checkpointing (so a `TimeoutError` mid-run doesn't lose all candidates)
 - Skill-size-based reflection-LM timeout scaling
 - BCa bootstrap upgrade once N≥20 routinely
+- **GEPA acceptance-gate work** (deviation #8 follow-up): the saturation pre-flight (`evolution/core/saturation_check.py`) addresses the user-visible symptom on saturated baselines (abort before GEPA spends budget). The underlying mechanism gap — stochastic small-minibatch `sum()` acceptance discarding per-instance signal — is tracked as Path D/E/C in `reports/pareto_frontier_feasibility.md` and remains future work (likely an upstream DSPy or GEPA PR).
 
 ## When to consult which doc
 

diff --git a/PLAN.md b/PLAN.md
@@ -460,6 +460,8 @@ These descriptions are sent with every API call as part of the tool schema — e
 7. **N=2 saturated baselines.** Weak-target hunt ran `evolve_tool` against `write_file` (98.8–99.2% holdout, 3 seeds, 1×/3× iter) and `search_files` (98.6% holdout). Both runs produced evolved descriptions byte-identical to the baseline — the knee-point picker correctly reverts to the unchanged baseline when GEPA's variants tie. The framework's tool-description pipeline is regression-catching, not improvement-finding, on these hand-tuned descriptions.
 8. **Closed-loop signal can flow into reflection but doesn't change selection on saturated baselines.** The `--closed-loop-during-evolution` flag plumbs `ValidationReport`s into the GEPA reflection LM's feedback channel via the existing 5-arg metric protocol, opt-in, saturation-gated. Verified end-to-end on `write_file`: closed-loop fired (file mutated + restored), the reflection LM saw the verdict, GEPA still selected the baseline byte-for-byte. The bottleneck sits upstream of reflection — GEPA's `sum(judge_scores)` acceptance rule ties when every candidate hits 1.0 on a saturated minibatch. Extending the Pareto frontier into behavioral space (closed-loop tasks as additional training-set instances with their own per-instance scores so a candidate can stay on the frontier by winning behavioral tasks) is the structural direction that would address this; the cache + renderer added here are the natural building blocks for that work.
 
+   **Follow-up — Path F (saturation pre-flight) addresses the user-visible symptom, not the underlying mechanism.** A separate investigation (`reports/pareto_frontier_feasibility.md`, two spike runs) confirmed the deviation's diagnosis and reframed it: the bottleneck isn't frontier shape, it's GEPA's stochastic small-minibatch `sum()` acceptance gate discarding per-instance signal before it can move selection. Path F (`evolution/core/saturation_check.py`) ships the user-visible fix — detect the saturated case before GEPA starts, render a panel explaining why no improvement is possible, default-deny in non-interactive contexts. This prevents the wasted-budget UX without solving the mechanism gap. The mechanism-side fix (Pareto-dominance acceptance, larger minibatch, or stratified sampling) is tracked as "Path D/E/C" in the feasibility report and remains future work.
+
 ### Phase 3: System Prompt Evolution
 
 **Goal:** Optimize the sections of the system prompt that guide agent behavior.

diff --git a/README.md b/README.md
@@ -245,6 +245,27 @@ uv run python -m evolution.tools.evolve_tool --tool X --manifest Y \
 
 Env vars: `EVOLVED_PATH`, `BASELINE_PATH`, `RUN_DIR`, `TARGET_NAME`, `ARTIFACT_TYPE`. The hook runs under `/bin/sh -c` — interactive aliases are not available; invoke binaries by full name. Trust boundary: the command string is yours, do not pass strings you didn't write yourself.
 
+### Saturation pre-flight (don't burn GEPA budget on hopeless runs)
+
+By default, every `evolve_skill` / `evolve_tool` run does a pre-flight: score the baseline on the holdout (and the closed-loop suite, if `--closed-loop-during-evolution` is set), classify into one of four bands (`healthy` / `no_headroom` / `weak_signal` / `uniform_failure`), and refuse to spend GEPA budget on a baseline that's already saturated.
+
+```
+Saturation check: holdout=0.987 (50 ex), closed-loop=1.000 (7 tasks)
+╭─── No measurable headroom ───────────╮
+│ Band: no_headroom                    │
+│ • Baseline already saturates the eval│
+│ • Try a harder closed-loop suite     │
+│ • Sanity check: synthetic generator? │
+╰──────────────────────────────────────╯
+Non-interactive context; refusing to proceed.
+Pass --force-saturation-check to override.
+```
+
+In interactive contexts, non-`healthy` bands prompt for confirmation (`Continue anyway? [y/N]`). In non-interactive contexts (no TTY on stdin — CI, background jobs, cron), the framework default-denies and exits cleanly with the override hint. Net cost is ~zero: the probe's holdout scores are reused at the post-GEPA evaluation site, so the baseline isn't re-scored at run end.
+
+- `--no-saturation-check` skips the probe entirely (useful when you've already validated headroom externally)
+- `--force-saturation-check` runs the probe + renders the panel but proceeds regardless of band
+
 ### Closed-loop validation (real agent on real tasks)
 
 The framework's deploy gate scores evolved artifacts against an LM-judge on a synthetic eval set. That's a closed loop: an LM scoring another LM's output on tasks a third LM made up. To break the loop, point a real agent at a small task suite with the baseline and evolved artifacts and see whether real agent behavior actually shifted:

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -13,12 +13,17 @@ flowchart LR
     A[CLI<br/>--skill X] --> B[Resolve SKILL.md<br/>SkillSource]
     B --> C[Build eval dataset<br/>synthetic / golden / sessiondb]
     C --> D[Wrap as<br/>SkillModule dspy.Module]
-    D --> E[GEPA optimizer<br/>+ BudgetAwareProposer]
+    D --> SAT[Saturation pre-flight<br/>baseline holdout + closed-loop probe]
+    SAT --> SATB{band ==<br/>healthy?}
+    SATB -- no --> SATA[Rich panel + prompt<br/>or default-deny]
+    SATA -- abort --> Z[sys.exit 0]
+    SATA -- proceed --> E
+    SATB -- yes --> E[GEPA optimizer<br/>+ BudgetAwareProposer]
     E --> F[Knee-point<br/>Pareto selection]
     F --> G[Static<br/>constraints]
     G --> H{pass?}
     H -- no --> I[Write evolved_FAILED.md<br/>+ gate_decision.json]
-    H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 2]
+    H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
     J --> K[Paired bootstrap<br/>per-example deltas]
     K --> L[Growth-with-quality<br/>gate]
     L --> M{deploy?}
@@ -166,7 +171,10 @@ When growth is below the free threshold, the gate degrades to "no-regression onl
 ### 9. Cost-ceiling kill switch
 `LMTimingCallback` also drives a per-run `CostLedger` that accumulates per-call cost from litellm's `_hidden_params`. `--max-total-cost-usd <N>` arms the ledger; once the accumulated cost crosses `N`, the next LM call raises `CostCeilingExceeded` from `LMTimingCallback.on_lm_start`. The orchestrator catches this at the top level and writes a `decision="aborted"` `gate_decision.json` with `cost_at_abort_usd` + `cost_ceiling_usd` + `cost_summary`. Worst-case overshoot is one LM call past the ceiling.
 
-### 10. Closed-loop validation as a separate surface
+### 10. Saturation pre-flight as a separate concern from the gate
+`evolution/core/saturation_check.py` runs BEFORE GEPA setup: scores the baseline on the holdout (and the closed-loop suite when configured), classifies into four bands (`healthy` / `no_headroom` / `weak_signal` / `uniform_failure`), and renders a Rich panel. Non-healthy bands prompt for confirmation in interactive contexts; default-deny in non-interactive contexts (no TTY) with a `--force-saturation-check` override. Skippable with `--no-saturation-check`. The probe's `holdout_per_example` is stashed and reused at the post-GEPA holdout site so net cost stays ~zero. Mirrors the `evolution/core/auth_check.py` pattern: pure helper returns a structured `SaturationReport`; rendering + exit handled by the call site. This is independent of the deploy gate (which runs AFTER GEPA on the evolved artifact) — the pre-flight is a "should we even start" decision; the gate is a "did we improve" decision.
+
+### 11. Closed-loop validation as a separate surface
 `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Available three ways:
 - **Post-gate veto** (`--benchmark-cmd "python -m evolution.validation.closed_loop ..."`) — runs after the deploy gate passes; nonzero exit flips the decision to reject with `reason="benchmark_failed"`.
 - **Reflection feedback** (`--closed-loop-during-evolution <suite.jsonl> --closed-loop-mode feedback`) — `ClosedLoopFeedbackCache` runs the validator during the GEPA loop, saturation-gated, and the verdict is rendered into the reflection LM's input via the metric's `dspy.Prediction.feedback` string. Score channel untouched.

diff --git a/docs/codebase_info.md b/docs/codebase_info.md
@@ -50,6 +50,7 @@ evolution/
 │   ├── fitness.py                       # LLMJudge + GEPA-shaped metric + behavioral score helper
 │   ├── lm_timing_callback.py            # LM-call observability + cost ledger + cost-ceiling kill switch
 │   ├── quality_gate.py                  # preset table + write_gate_decision (shared by skill/tool pipelines)
+│   ├── saturation_check.py              # pre-flight: classify baseline into healthy/no_headroom/weak_signal/uniform_failure + Rich panel + abort
 │   ├── skill_sources.py                 # SkillSource protocol + 3 implementations
 │   └── stats.py                         # paired_bootstrap CI
 ├── skills/                              # Tier 1: skill-file evolution
@@ -90,7 +91,8 @@ evolution/
 | `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper |
 | `evolution/core/constraints.py` | ~320 | static + growth-with-quality + size constraints |
 | `evolution/skills/budget_aware_proposer.py` | ~300 | char-budget reflection prompt |
-| `evolution/core/closed_loop_feedback.py` | ~295 | cache + saturation gate + deterministic feedback block |
+| `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) |
+| `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm |
 | `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch |
 | `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check |
 | `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision |
@@ -109,7 +111,7 @@ evolution/
 | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
 | **Total** | **~9,000** | excludes empty `__init__.py` shims |
 
-Test suite: 37 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **681 tests** collected.
+Test suite: 55 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1076 tests** collected.
 
 ## Runtime dependencies