Skip to content

feat: saturation pre-flight for evolve_skill and evolve_tool#64

Merged
jramos merged 8 commits into
mainfrom
path-f-saturation-preflight
May 22, 2026
Merged

feat: saturation pre-flight for evolve_skill and evolve_tool#64
jramos merged 8 commits into
mainfrom
path-f-saturation-preflight

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 22, 2026

Summary

  • New evolution/core/saturation_check.py module + --no-saturation-check / --force-saturation-check flags on both evolve_skill and evolve_tool. Scores the baseline on the holdout (and the closed-loop suite, if --closed-loop-during-evolution is set) BEFORE GEPA starts, classifies into healthy / no_headroom / weak_signal / uniform_failure, then prompts (interactive context) or default-denies (non-interactive — no TTY on stdin) on non-healthy bands. The preflight's baseline holdout scores are reused at the post-GEPA evaluation site, so net cost stays roughly zero.
  • New ClosedLoopFeedbackCache.force_run method: bypasses the saturation gate (which returns None at preflight time in default sampled mode, before any judge scores have been recorded) and propagates validator errors instead of swallowing them. Preserves the first-fire allowance for downstream get_or_run callers.
  • Refined band classifier so the common saturated case lands in no_headroom rather than healthy: closed-loop ≥ 0.95 plus synthetic ≥ 0.95 → no_headroom. End-to-end smoke against the known-saturated write_file baseline (synthetic 0.987, closed-loop 1.000) now aborts cleanly with the panel instead of running 155 no-op GEPA iterations.
  • Docs updated across the board: architecture flow diagram, components reference, data models, workflows (new Phase B.5), CLI interfaces, README, AGENTS.md (5-line summary + component map), and a PLAN.md follow-up paragraph on the deviation that motivated this work.

Test plan

  • Pull, run uv run pytest -q1076 passed in 101s (up from 1063 pre-branch, +13 new tests).
  • Manual smoke against a known-saturated baseline (evolve_tool --tool write_file with default Hermes config + --closed-loop-during-evolution evolution/validation/suites/write_file.jsonl --closed-loop-hermes-repo <path>) — abort fired: No measurable headroom panel rendered (band=no_headroom, holdout=0.987, closed-loop=1.000), Non-interactive context; refusing to proceed. Pass --force-saturation-check to override., exit 0; zero GEPA spend (no "Running GEPA optimization" output); hermes-agent checkout verified clean.
  • --no-saturation-check skips the preflight: log went directly from Validating baseline description to Running GEPA optimization (max_full_evals=1) with no saturation panel; valset stayed at 39 examples (no behavioral injection since CL wasn't enabled); exit 0.
  • --force-saturation-check against the same saturated baseline: panel rendered (No measurable headroom, identical content to test 2), then Running GEPA optimization proceeded; GEPA finished its no-op pass and the gate produced its normal "Evolution did not improve" line; exit 0.
  • Cache reuse: targeted run of test_cache_reuse_skips_baseline_re_eval_after_gepa in both pipelines — 2/2 passed in 1.55s. Asserts _holdout_evaluate_with_metric call count == 1 when the preflight ran, vs. the call count of 2 when --no-saturation-check is set.

jramos added 8 commits May 21, 2026 10:57
…ache

Adds evolution/core/saturation_check.py mirroring auth_check's shape:
pure saturation_preflight() returns a SaturationReport classifying the
baseline into healthy / no_headroom / weak_signal / uniform_failure.
Call sites in evolve_skill / evolve_tool will render a Rich panel and
decide whether to prompt or default-deny (next two commits).

Also adds ClosedLoopFeedbackCache.force_run: bypasses should_run()
and propagates validator exceptions (unlike get_or_run which swallows
the expected ones to keep GEPA going). Preflight needs to fire the
validator once at startup, before any judge scores have been recorded,
which is when get_or_run would return None in sampled mode.

Pure helpers; no wiring yet. Wiring lands in feat(skills) and
feat(tools) follow-ups.
After the synthetic dataset builds and the baseline module / metric /
closed_loop_cache are constructed (and before GEPA setup), the
framework now runs the saturation preflight from feat(core). Two new
flags: --no-saturation-check (skip entirely) and
--force-saturation-check (run + render but bypass the
abort/prompt). Default UX in interactive contexts is warn+confirm;
in non-interactive contexts (no TTY on stdin), non-healthy bands
exit cleanly with a "use --force-saturation-check" hint.

The baseline holdout per-example scores from the preflight are
stashed and reused at the post-GEPA holdout-comparison call site, so
the baseline isn't re-scored at run end. Net cost: ~zero.

Closes the "doomed runs spend GEPA budget before any signal" gap
documented in reports/pareto_frontier_feasibility.md spike #2.
Symmetric to the evolve_tool wiring from the previous commit.
After the synthetic dataset builds and baseline_module / metric /
closed_loop_cache are constructed (and before GEPA setup), the
framework runs saturation_preflight; non-healthy bands prompt
(interactive) or default-deny (non-interactive) with a
--force-saturation-check override. Baseline holdout per-example
scores from the preflight are reused at the post-GEPA
holdout-comparison call site to keep net cost ~zero.

The per-candidate _holdout_evaluate_with_metric inside the
knee-point loop is deliberately untouched — only the final
baseline-vs-evolved comparison reuses the cached scores.

Completes Path F across both pipelines.
…e tests

Two follow-ups from the final code review of the Path F branch:

1. ClosedLoopFeedbackCache.force_run was resetting _iters_since_last_run
   to 0, eating the "allow first fire" allowance that __init__ sets up
   (= min_iters). In sampled gate_mode this delayed the first GEPA-time
   closed-loop fire by min_iters iterations. Now force_run preserves
   the allowance so subsequent get_or_run calls fire as originally
   designed. Tests confirm should_run() still returns True after a
   force_run when judge history is empty.

2. Added integration tests for both evolve_tool and evolve_skill that
   verify the cache-reuse mechanism: when the saturation preflight runs
   and populates the cached baseline holdout scores, the post-GEPA
   evaluation site reuses them instead of re-running the baseline
   eval. This locks in the "net cost ~zero" correctness claim.
…lose case

The default thresholds shipped in feat(core) were too strict for the
case Path F was built to catch. Spike #1 in the feasibility report
documented synthetic=0.987 + closed-loop=1.0 for the saturated
write_file baseline — GEPA can't improve on that, but the strict AND
(synthetic ≥ 0.99) gate let it through as healthy. The realtime
smoke from the merge-readiness check confirmed: preflight ran, both
scores looked right, classifier returned healthy, GEPA burned 155
no-op iterations.

Refined no_headroom logic:
- (synthetic ≥ 0.99 AND no CL signal) — unchanged, judge alone pegged
- (CL ≥ 0.95 AND synthetic ≥ weak_syn=0.95) — NEW, both signals
  effectively pegged

The synthetic_close gate on the new clause keeps (synthetic=0.5,
CL=1.0) classified as healthy — that scenario means there's real
judge signal to optimize over (or the eval is misconfigured) and
should not auto-abort.

Two new tests pin both the smoke case and the edge case.
Updates across the docs/ knowledge base, AGENTS.md, README.md, and
PLAN.md to reflect the new saturation pre-flight feature:

- architecture.md: top-level flow now shows the pre-flight + abort
  path; new design pattern #10 separates the pre-flight (a "should we
  even start" decision) from the deploy gate (a "did we improve"
  decision).
- components.md: new saturation_check.py section documenting the
  band classifier logic + public surface; force_run added to the
  ClosedLoopFeedbackCache surface.
- data_models.md: new SaturationReport dataclass entry.
- workflows.md: Workflow 1 gets a Phase B.5 mermaid for the
  pre-flight; Phase D's holdout step shows the cache-reuse branch.
- interfaces.md: --no-saturation-check + --force-saturation-check
  added to both skill and tool flag tables.
- index.md: new routing entry, new cross-cutting topic, refreshed
  test count (681 → 1076), maintenance-note entry for the default
  thresholds (likely to be calibrated).
- codebase_info.md: saturation_check.py added to layout + LOC table;
  test count refreshed.
- framework_advantages.md: new "Saturation pre-flight that refuses
  to spend budget on hopeless runs" section, positioned as a
  framework advantage over raw GEPA.
- AGENTS.md: 5-line run summary updated; component map adds
  saturation_check.py; planned/deferred section gets a Path D/E/C
  entry pointing at the feasibility report.
- README.md: new "Saturation pre-flight" section in the Safety knobs
  area with example panel output.
- PLAN.md: deviation #8 gets a follow-up paragraph noting that
  Path F addresses the user-visible symptom but not the underlying
  acceptance-gate mechanism.

No source files touched.
…gration tests

Two of the new integration tests reached the real synthetic dataset
generator before the (mocked) saturation_preflight, so CI runs with a
fake OPENAI_API_KEY died on AuthError before the code under test ever
executed:

- test_saturated_band_non_interactive_aborts (both pipelines)
- test_cache_reuse_skips_baseline_re_eval_after_gepa (both pipelines)

Add a SyntheticDatasetBuilder mock that returns a small list/EvalDataset
of fake EvalExamples (no LM calls). Skill-side fake dataset is sized to
50 examples (30/10/10) so the holdout ≥ EvolutionConfig.min_holdout_size
guard doesn't trip before reaching the preflight.

Verified locally by running the test files under
  env -i ... OPENAI_API_KEY=sk-fake-test-key uv run pytest ...
to match the CI environment — all 10 saturation-preflight tests pass,
full suite still at 1076.

The other 3 tests in each file (test_no_saturation_check_flag_skips_helper,
test_healthy_band_does_not_prompt, test_force_saturation_check_overrides_abort)
"pass" in CI for the wrong reason — their assertions are satisfied even
when the run dies on AuthError before reaching the wiring under test.
Worth tightening in a follow-up; not blocking this fix.
…ning

Addresses six items from the PR review:

1. Non-interactive deny now exits 3 (was 0). A scheduled / CI / cron
   wrapper couldn't previously distinguish "refused to run because no
   TTY" from "ran cleanly". Interactive user-said-no still exits 0
   (success-by-intent). The integration test asserts the new code.

2. ClosedLoopFeedbackCache registers weakref.finalize for its tmp dir
   so SystemExit (the saturation abort path) triggers cleanup instead
   of leaking dirs into /tmp for the OS reaper to handle 3+ days later.
   Updated the class docstring to match.

3. saturation_preflight's docstring no longer claims "Pure: no side
   effects" — it has LM eval, may run a validator subprocess, mutates
   the cache. The actual property is "doesn't render, prompt, or exit"
   — call sites own those — and the docstring now says exactly that.

4. force_run's docstring spells out the _iters_since_last_run = min_iters
   contract (preserving the first-fire allowance for downstream
   get_or_run callers). Inline comment on the __init__ assignment
   anchors the invariant in both places so a future "cleanup" can't
   silently regress the fix to 0.

5. interactive_confirm's docstring acknowledges the EOFError branch
   the code already catches (not just KeyboardInterrupt).

6. De-vacuoused 2 CLI tests that previously passed even when production
   was mutated to ignore the flags they claimed to test:
   test_force_saturation_check_overrides_abort and
   test_healthy_band_does_not_prompt now assert GEPA was actually
   instantiated. Both add the SyntheticDatasetBuilder / select_knee_point
   / _holdout_evaluate_with_metric mock chain so the run flows through
   the production code instead of dying on AuthError at dataset gen.
   Added a new test_user_declines_at_prompt_aborts in both pipelines
   covering the previously-untested "Aborted by user." branch.

77 saturation-related tests pass, full suite at 1078 (was 1076).
@jramos jramos merged commit b4e3a28 into main May 22, 2026
4 checks passed
@jramos jramos deleted the path-f-saturation-preflight branch May 22, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant