feat: saturation pre-flight for evolve_skill and evolve_tool#64
Merged
Conversation
…ache Adds evolution/core/saturation_check.py mirroring auth_check's shape: pure saturation_preflight() returns a SaturationReport classifying the baseline into healthy / no_headroom / weak_signal / uniform_failure. Call sites in evolve_skill / evolve_tool will render a Rich panel and decide whether to prompt or default-deny (next two commits). Also adds ClosedLoopFeedbackCache.force_run: bypasses should_run() and propagates validator exceptions (unlike get_or_run which swallows the expected ones to keep GEPA going). Preflight needs to fire the validator once at startup, before any judge scores have been recorded, which is when get_or_run would return None in sampled mode. Pure helpers; no wiring yet. Wiring lands in feat(skills) and feat(tools) follow-ups.
After the synthetic dataset builds and the baseline module / metric / closed_loop_cache are constructed (and before GEPA setup), the framework now runs the saturation preflight from feat(core). Two new flags: --no-saturation-check (skip entirely) and --force-saturation-check (run + render but bypass the abort/prompt). Default UX in interactive contexts is warn+confirm; in non-interactive contexts (no TTY on stdin), non-healthy bands exit cleanly with a "use --force-saturation-check" hint. The baseline holdout per-example scores from the preflight are stashed and reused at the post-GEPA holdout-comparison call site, so the baseline isn't re-scored at run end. Net cost: ~zero. Closes the "doomed runs spend GEPA budget before any signal" gap documented in reports/pareto_frontier_feasibility.md spike #2.
Symmetric to the evolve_tool wiring from the previous commit. After the synthetic dataset builds and baseline_module / metric / closed_loop_cache are constructed (and before GEPA setup), the framework runs saturation_preflight; non-healthy bands prompt (interactive) or default-deny (non-interactive) with a --force-saturation-check override. Baseline holdout per-example scores from the preflight are reused at the post-GEPA holdout-comparison call site to keep net cost ~zero. The per-candidate _holdout_evaluate_with_metric inside the knee-point loop is deliberately untouched — only the final baseline-vs-evolved comparison reuses the cached scores. Completes Path F across both pipelines.
…e tests Two follow-ups from the final code review of the Path F branch: 1. ClosedLoopFeedbackCache.force_run was resetting _iters_since_last_run to 0, eating the "allow first fire" allowance that __init__ sets up (= min_iters). In sampled gate_mode this delayed the first GEPA-time closed-loop fire by min_iters iterations. Now force_run preserves the allowance so subsequent get_or_run calls fire as originally designed. Tests confirm should_run() still returns True after a force_run when judge history is empty. 2. Added integration tests for both evolve_tool and evolve_skill that verify the cache-reuse mechanism: when the saturation preflight runs and populates the cached baseline holdout scores, the post-GEPA evaluation site reuses them instead of re-running the baseline eval. This locks in the "net cost ~zero" correctness claim.
…lose case The default thresholds shipped in feat(core) were too strict for the case Path F was built to catch. Spike #1 in the feasibility report documented synthetic=0.987 + closed-loop=1.0 for the saturated write_file baseline — GEPA can't improve on that, but the strict AND (synthetic ≥ 0.99) gate let it through as healthy. The realtime smoke from the merge-readiness check confirmed: preflight ran, both scores looked right, classifier returned healthy, GEPA burned 155 no-op iterations. Refined no_headroom logic: - (synthetic ≥ 0.99 AND no CL signal) — unchanged, judge alone pegged - (CL ≥ 0.95 AND synthetic ≥ weak_syn=0.95) — NEW, both signals effectively pegged The synthetic_close gate on the new clause keeps (synthetic=0.5, CL=1.0) classified as healthy — that scenario means there's real judge signal to optimize over (or the eval is misconfigured) and should not auto-abort. Two new tests pin both the smoke case and the edge case.
Updates across the docs/ knowledge base, AGENTS.md, README.md, and PLAN.md to reflect the new saturation pre-flight feature: - architecture.md: top-level flow now shows the pre-flight + abort path; new design pattern #10 separates the pre-flight (a "should we even start" decision) from the deploy gate (a "did we improve" decision). - components.md: new saturation_check.py section documenting the band classifier logic + public surface; force_run added to the ClosedLoopFeedbackCache surface. - data_models.md: new SaturationReport dataclass entry. - workflows.md: Workflow 1 gets a Phase B.5 mermaid for the pre-flight; Phase D's holdout step shows the cache-reuse branch. - interfaces.md: --no-saturation-check + --force-saturation-check added to both skill and tool flag tables. - index.md: new routing entry, new cross-cutting topic, refreshed test count (681 → 1076), maintenance-note entry for the default thresholds (likely to be calibrated). - codebase_info.md: saturation_check.py added to layout + LOC table; test count refreshed. - framework_advantages.md: new "Saturation pre-flight that refuses to spend budget on hopeless runs" section, positioned as a framework advantage over raw GEPA. - AGENTS.md: 5-line run summary updated; component map adds saturation_check.py; planned/deferred section gets a Path D/E/C entry pointing at the feasibility report. - README.md: new "Saturation pre-flight" section in the Safety knobs area with example panel output. - PLAN.md: deviation #8 gets a follow-up paragraph noting that Path F addresses the user-visible symptom but not the underlying acceptance-gate mechanism. No source files touched.
…gration tests Two of the new integration tests reached the real synthetic dataset generator before the (mocked) saturation_preflight, so CI runs with a fake OPENAI_API_KEY died on AuthError before the code under test ever executed: - test_saturated_band_non_interactive_aborts (both pipelines) - test_cache_reuse_skips_baseline_re_eval_after_gepa (both pipelines) Add a SyntheticDatasetBuilder mock that returns a small list/EvalDataset of fake EvalExamples (no LM calls). Skill-side fake dataset is sized to 50 examples (30/10/10) so the holdout ≥ EvolutionConfig.min_holdout_size guard doesn't trip before reaching the preflight. Verified locally by running the test files under env -i ... OPENAI_API_KEY=sk-fake-test-key uv run pytest ... to match the CI environment — all 10 saturation-preflight tests pass, full suite still at 1076. The other 3 tests in each file (test_no_saturation_check_flag_skips_helper, test_healthy_band_does_not_prompt, test_force_saturation_check_overrides_abort) "pass" in CI for the wrong reason — their assertions are satisfied even when the run dies on AuthError before reaching the wiring under test. Worth tightening in a follow-up; not blocking this fix.
…ning Addresses six items from the PR review: 1. Non-interactive deny now exits 3 (was 0). A scheduled / CI / cron wrapper couldn't previously distinguish "refused to run because no TTY" from "ran cleanly". Interactive user-said-no still exits 0 (success-by-intent). The integration test asserts the new code. 2. ClosedLoopFeedbackCache registers weakref.finalize for its tmp dir so SystemExit (the saturation abort path) triggers cleanup instead of leaking dirs into /tmp for the OS reaper to handle 3+ days later. Updated the class docstring to match. 3. saturation_preflight's docstring no longer claims "Pure: no side effects" — it has LM eval, may run a validator subprocess, mutates the cache. The actual property is "doesn't render, prompt, or exit" — call sites own those — and the docstring now says exactly that. 4. force_run's docstring spells out the _iters_since_last_run = min_iters contract (preserving the first-fire allowance for downstream get_or_run callers). Inline comment on the __init__ assignment anchors the invariant in both places so a future "cleanup" can't silently regress the fix to 0. 5. interactive_confirm's docstring acknowledges the EOFError branch the code already catches (not just KeyboardInterrupt). 6. De-vacuoused 2 CLI tests that previously passed even when production was mutated to ignore the flags they claimed to test: test_force_saturation_check_overrides_abort and test_healthy_band_does_not_prompt now assert GEPA was actually instantiated. Both add the SyntheticDatasetBuilder / select_knee_point / _holdout_evaluate_with_metric mock chain so the run flows through the production code instead of dying on AuthError at dataset gen. Added a new test_user_declines_at_prompt_aborts in both pipelines covering the previously-untested "Aborted by user." branch. 77 saturation-related tests pass, full suite at 1078 (was 1076).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evolution/core/saturation_check.pymodule +--no-saturation-check/--force-saturation-checkflags on bothevolve_skillandevolve_tool. Scores the baseline on the holdout (and the closed-loop suite, if--closed-loop-during-evolutionis set) BEFORE GEPA starts, classifies intohealthy/no_headroom/weak_signal/uniform_failure, then prompts (interactive context) or default-denies (non-interactive — no TTY on stdin) on non-healthybands. The preflight's baseline holdout scores are reused at the post-GEPA evaluation site, so net cost stays roughly zero.ClosedLoopFeedbackCache.force_runmethod: bypasses the saturation gate (which returnsNoneat preflight time in defaultsampledmode, before any judge scores have been recorded) and propagates validator errors instead of swallowing them. Preserves the first-fire allowance for downstreamget_or_runcallers.no_headroomrather thanhealthy: closed-loop ≥ 0.95 plus synthetic ≥ 0.95 →no_headroom. End-to-end smoke against the known-saturatedwrite_filebaseline (synthetic 0.987, closed-loop 1.000) now aborts cleanly with the panel instead of running 155 no-op GEPA iterations.Test plan
uv run pytest -q— 1076 passed in 101s (up from 1063 pre-branch, +13 new tests).evolve_tool --tool write_filewith default Hermes config +--closed-loop-during-evolution evolution/validation/suites/write_file.jsonl --closed-loop-hermes-repo <path>) — abort fired:No measurable headroompanel rendered (band=no_headroom, holdout=0.987, closed-loop=1.000),Non-interactive context; refusing to proceed. Pass --force-saturation-check to override., exit 0; zero GEPA spend (no "Running GEPA optimization" output); hermes-agent checkout verified clean.--no-saturation-checkskips the preflight: log went directly fromValidating baseline descriptiontoRunning GEPA optimization (max_full_evals=1)with no saturation panel; valset stayed at 39 examples (no behavioral injection since CL wasn't enabled); exit 0.--force-saturation-checkagainst the same saturated baseline: panel rendered (No measurable headroom, identical content to test 2), thenRunning GEPA optimizationproceeded; GEPA finished its no-op pass and the gate produced its normal "Evolution did not improve" line; exit 0.test_cache_reuse_skips_baseline_re_eval_after_gepain both pipelines — 2/2 passed in 1.55s. Asserts_holdout_evaluate_with_metriccall count == 1 when the preflight ran, vs. the call count of 2 when--no-saturation-checkis set.