feat: saturation pre-flight for evolve_skill and evolve_tool by jramos · Pull Request #64 · jramos/agent-self-evolution

jramos · 2026-05-22T00:47:43Z

Summary

New evolution/core/saturation_check.py module + --no-saturation-check / --force-saturation-check flags on both evolve_skill and evolve_tool. Scores the baseline on the holdout (and the closed-loop suite, if --closed-loop-during-evolution is set) BEFORE GEPA starts, classifies into healthy / no_headroom / weak_signal / uniform_failure, then prompts (interactive context) or default-denies (non-interactive — no TTY on stdin) on non-healthy bands. The preflight's baseline holdout scores are reused at the post-GEPA evaluation site, so net cost stays roughly zero.
New ClosedLoopFeedbackCache.force_run method: bypasses the saturation gate (which returns None at preflight time in default sampled mode, before any judge scores have been recorded) and propagates validator errors instead of swallowing them. Preserves the first-fire allowance for downstream get_or_run callers.
Refined band classifier so the common saturated case lands in no_headroom rather than healthy: closed-loop ≥ 0.95 plus synthetic ≥ 0.95 → no_headroom. End-to-end smoke against the known-saturated write_file baseline (synthetic 0.987, closed-loop 1.000) now aborts cleanly with the panel instead of running 155 no-op GEPA iterations.
Docs updated across the board: architecture flow diagram, components reference, data models, workflows (new Phase B.5), CLI interfaces, README, AGENTS.md (5-line summary + component map), and a PLAN.md follow-up paragraph on the deviation that motivated this work.

Test plan

Pull, run uv run pytest -q — 1076 passed in 101s (up from 1063 pre-branch, +13 new tests).
Manual smoke against a known-saturated baseline (evolve_tool --tool write_file with default Hermes config + --closed-loop-during-evolution evolution/validation/suites/write_file.jsonl --closed-loop-hermes-repo <path>) — abort fired: No measurable headroom panel rendered (band=no_headroom, holdout=0.987, closed-loop=1.000), Non-interactive context; refusing to proceed. Pass --force-saturation-check to override., exit 0; zero GEPA spend (no "Running GEPA optimization" output); hermes-agent checkout verified clean.
--no-saturation-check skips the preflight: log went directly from Validating baseline description to Running GEPA optimization (max_full_evals=1) with no saturation panel; valset stayed at 39 examples (no behavioral injection since CL wasn't enabled); exit 0.
--force-saturation-check against the same saturated baseline: panel rendered (No measurable headroom, identical content to test 2), then Running GEPA optimization proceeded; GEPA finished its no-op pass and the gate produced its normal "Evolution did not improve" line; exit 0.
Cache reuse: targeted run of test_cache_reuse_skips_baseline_re_eval_after_gepa in both pipelines — 2/2 passed in 1.55s. Asserts _holdout_evaluate_with_metric call count == 1 when the preflight ran, vs. the call count of 2 when --no-saturation-check is set.

…ache Adds evolution/core/saturation_check.py mirroring auth_check's shape: pure saturation_preflight() returns a SaturationReport classifying the baseline into healthy / no_headroom / weak_signal / uniform_failure. Call sites in evolve_skill / evolve_tool will render a Rich panel and decide whether to prompt or default-deny (next two commits). Also adds ClosedLoopFeedbackCache.force_run: bypasses should_run() and propagates validator exceptions (unlike get_or_run which swallows the expected ones to keep GEPA going). Preflight needs to fire the validator once at startup, before any judge scores have been recorded, which is when get_or_run would return None in sampled mode. Pure helpers; no wiring yet. Wiring lands in feat(skills) and feat(tools) follow-ups.

After the synthetic dataset builds and the baseline module / metric / closed_loop_cache are constructed (and before GEPA setup), the framework now runs the saturation preflight from feat(core). Two new flags: --no-saturation-check (skip entirely) and --force-saturation-check (run + render but bypass the abort/prompt). Default UX in interactive contexts is warn+confirm; in non-interactive contexts (no TTY on stdin), non-healthy bands exit cleanly with a "use --force-saturation-check" hint. The baseline holdout per-example scores from the preflight are stashed and reused at the post-GEPA holdout-comparison call site, so the baseline isn't re-scored at run end. Net cost: ~zero. Closes the "doomed runs spend GEPA budget before any signal" gap documented in reports/pareto_frontier_feasibility.md spike #2.

Symmetric to the evolve_tool wiring from the previous commit. After the synthetic dataset builds and baseline_module / metric / closed_loop_cache are constructed (and before GEPA setup), the framework runs saturation_preflight; non-healthy bands prompt (interactive) or default-deny (non-interactive) with a --force-saturation-check override. Baseline holdout per-example scores from the preflight are reused at the post-GEPA holdout-comparison call site to keep net cost ~zero. The per-candidate _holdout_evaluate_with_metric inside the knee-point loop is deliberately untouched — only the final baseline-vs-evolved comparison reuses the cached scores. Completes Path F across both pipelines.

…e tests Two follow-ups from the final code review of the Path F branch: 1. ClosedLoopFeedbackCache.force_run was resetting _iters_since_last_run to 0, eating the "allow first fire" allowance that __init__ sets up (= min_iters). In sampled gate_mode this delayed the first GEPA-time closed-loop fire by min_iters iterations. Now force_run preserves the allowance so subsequent get_or_run calls fire as originally designed. Tests confirm should_run() still returns True after a force_run when judge history is empty. 2. Added integration tests for both evolve_tool and evolve_skill that verify the cache-reuse mechanism: when the saturation preflight runs and populates the cached baseline holdout scores, the post-GEPA evaluation site reuses them instead of re-running the baseline eval. This locks in the "net cost ~zero" correctness claim.

…lose case The default thresholds shipped in feat(core) were too strict for the case Path F was built to catch. Spike #1 in the feasibility report documented synthetic=0.987 + closed-loop=1.0 for the saturated write_file baseline — GEPA can't improve on that, but the strict AND (synthetic ≥ 0.99) gate let it through as healthy. The realtime smoke from the merge-readiness check confirmed: preflight ran, both scores looked right, classifier returned healthy, GEPA burned 155 no-op iterations. Refined no_headroom logic: - (synthetic ≥ 0.99 AND no CL signal) — unchanged, judge alone pegged - (CL ≥ 0.95 AND synthetic ≥ weak_syn=0.95) — NEW, both signals effectively pegged The synthetic_close gate on the new clause keeps (synthetic=0.5, CL=1.0) classified as healthy — that scenario means there's real judge signal to optimize over (or the eval is misconfigured) and should not auto-abort. Two new tests pin both the smoke case and the edge case.

Updates across the docs/ knowledge base, AGENTS.md, README.md, and PLAN.md to reflect the new saturation pre-flight feature: - architecture.md: top-level flow now shows the pre-flight + abort path; new design pattern #10 separates the pre-flight (a "should we even start" decision) from the deploy gate (a "did we improve" decision). - components.md: new saturation_check.py section documenting the band classifier logic + public surface; force_run added to the ClosedLoopFeedbackCache surface. - data_models.md: new SaturationReport dataclass entry. - workflows.md: Workflow 1 gets a Phase B.5 mermaid for the pre-flight; Phase D's holdout step shows the cache-reuse branch. - interfaces.md: --no-saturation-check + --force-saturation-check added to both skill and tool flag tables. - index.md: new routing entry, new cross-cutting topic, refreshed test count (681 → 1076), maintenance-note entry for the default thresholds (likely to be calibrated). - codebase_info.md: saturation_check.py added to layout + LOC table; test count refreshed. - framework_advantages.md: new "Saturation pre-flight that refuses to spend budget on hopeless runs" section, positioned as a framework advantage over raw GEPA. - AGENTS.md: 5-line run summary updated; component map adds saturation_check.py; planned/deferred section gets a Path D/E/C entry pointing at the feasibility report. - README.md: new "Saturation pre-flight" section in the Safety knobs area with example panel output. - PLAN.md: deviation #8 gets a follow-up paragraph noting that Path F addresses the user-visible symptom but not the underlying acceptance-gate mechanism. No source files touched.

…gration tests Two of the new integration tests reached the real synthetic dataset generator before the (mocked) saturation_preflight, so CI runs with a fake OPENAI_API_KEY died on AuthError before the code under test ever executed: - test_saturated_band_non_interactive_aborts (both pipelines) - test_cache_reuse_skips_baseline_re_eval_after_gepa (both pipelines) Add a SyntheticDatasetBuilder mock that returns a small list/EvalDataset of fake EvalExamples (no LM calls). Skill-side fake dataset is sized to 50 examples (30/10/10) so the holdout ≥ EvolutionConfig.min_holdout_size guard doesn't trip before reaching the preflight. Verified locally by running the test files under env -i ... OPENAI_API_KEY=sk-fake-test-key uv run pytest ... to match the CI environment — all 10 saturation-preflight tests pass, full suite still at 1076. The other 3 tests in each file (test_no_saturation_check_flag_skips_helper, test_healthy_band_does_not_prompt, test_force_saturation_check_overrides_abort) "pass" in CI for the wrong reason — their assertions are satisfied even when the run dies on AuthError before reaching the wiring under test. Worth tightening in a follow-up; not blocking this fix.

…ning Addresses six items from the PR review: 1. Non-interactive deny now exits 3 (was 0). A scheduled / CI / cron wrapper couldn't previously distinguish "refused to run because no TTY" from "ran cleanly". Interactive user-said-no still exits 0 (success-by-intent). The integration test asserts the new code. 2. ClosedLoopFeedbackCache registers weakref.finalize for its tmp dir so SystemExit (the saturation abort path) triggers cleanup instead of leaking dirs into /tmp for the OS reaper to handle 3+ days later. Updated the class docstring to match. 3. saturation_preflight's docstring no longer claims "Pure: no side effects" — it has LM eval, may run a validator subprocess, mutates the cache. The actual property is "doesn't render, prompt, or exit" — call sites own those — and the docstring now says exactly that. 4. force_run's docstring spells out the _iters_since_last_run = min_iters contract (preserving the first-fire allowance for downstream get_or_run callers). Inline comment on the __init__ assignment anchors the invariant in both places so a future "cleanup" can't silently regress the fix to 0. 5. interactive_confirm's docstring acknowledges the EOFError branch the code already catches (not just KeyboardInterrupt). 6. De-vacuoused 2 CLI tests that previously passed even when production was mutated to ignore the flags they claimed to test: test_force_saturation_check_overrides_abort and test_healthy_band_does_not_prompt now assert GEPA was actually instantiated. Both add the SyntheticDatasetBuilder / select_knee_point / _holdout_evaluate_with_metric mock chain so the run flows through the production code instead of dying on AuthError at dataset gen. Added a new test_user_declines_at_prompt_aborts in both pipelines covering the previously-untested "Aborted by user." branch. 77 saturation-related tests pass, full suite at 1078 (was 1076).

jramos added 8 commits May 21, 2026 10:57

jramos merged commit b4e3a28 into main May 22, 2026
4 checks passed

jramos deleted the path-f-saturation-preflight branch May 22, 2026 02:46

jramos mentioned this pull request May 22, 2026

feat: --gepa-minibatch-size CLI flag (Path E) #65

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: saturation pre-flight for evolve_skill and evolve_tool#64

feat: saturation pre-flight for evolve_skill and evolve_tool#64
jramos merged 8 commits into
mainfrom
path-f-saturation-preflight

jramos commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 22, 2026 •

edited

Loading