test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation#68
Merged
Merged
Conversation
…lidation
Two fixtures that together create a stable weak_signal baseline for the
saturation pre-flight, enabling closed-loop retro-validation of the
--gepa-minibatch-size flag against an honest behavioral signal.
evolution/validation/suites/write_file_ambiguous.jsonl
7 closed-loop tasks where the right tool choice depends on
understanding write_file's wholesale-overwrite semantics. Tasks
populate existing files via fixture_setup so write_file would WIPE
content if used wrongly on a targeted-edit task.
tests/fixtures/tool_manifests/weakened_write_file_manifest.json
A 24-char write_file description ("Write content to a file.") plus a
near-production patch description as the confusable neighbor. The
weakening omits the wholesale-overwrite cue.
Probe-loop results (gpt-5-mini validator, openai/gpt-5-mini routed
correctly post-PR-66):
Strong description + ambiguous suite, seed 42 → CL 0.714, weak_signal
Weakened fixture + ambiguous suite, seed 42 → CL 0.714, weak_signal
Weakened fixture + ambiguous suite, seed 43 → CL 0.714, weak_signal
Both descriptions land at 5/7 with the new suite, which means GEPA
proposals have real headroom (CL can move toward 6/7 or 7/7). The
synthetic judge stays saturated; closed-loop is the only gradient.
jramos
added a commit
that referenced
this pull request
May 23, 2026
* feat(quality_gate): add _check_cl_primary_gate helper Pure function returning a ConstraintResult for the closed-loop-primary deploy decision. Used when saturation pre-flight reports weak_signal band. Required gain scales with description growth, mirroring the synthetic gate's free_threshold + slope shape; synthetic regression tolerance of 0.05 protects against catastrophic judge collapse. 11 unit tests cover the decision-rule math including the PR #68 calibration point (+2 gain on +121% growth -> required 2, just passes) and wallpaper protection (+1 gain on +400% growth -> required 4, fails). * refactor(evolve_tool): preserve SaturationReport fields for deploy gate Today only sat_report.holdout_per_example survives past the preflight call site; subsequent CL-aware gate work needs the band classification and baseline CL per-task scores too. Bind four new locals next to the existing cache: band, cl_per_example, holdout_score, cl_score. All default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged. * feat(evolve_tool): branch deploy gate on saturation band When preflight reports weak_signal AND closed-loop is configured, run a one-shot force_run on the evolved description and gate the deploy decision on closed-loop signal via _check_cl_primary_gate. Three abort paths are written to gate_decision.json with diagnostic payloads (schema v5): - cl_eval_failed: force_run raised an exception - cl_eval_incomplete: one or more evolved CL tasks abstained (runner errored — distinguished from genuine task failure via the existing TaskResult.abstained field) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_char_ceiling is preserved in the CL-primary path — wallpaper protection is orthogonal to which signal we gate on. All other bands (healthy / no_headroom / uniform_failure / no preflight) fall through to the existing synthetic path unchanged. * fix(evolve_tool): narrow cl_constraint type, surface saved-variant path Code review found two minor issues in the CL-primary branch added by ae1fea6: 1. cl_constraint: Optional[ConstraintResult] flows into a list[ConstraintResult] without type narrowing at the post-branch growth_constraints assignment. Added an assert so the type checker sees the correlation between the two 'if use_cl_primary:' blocks. 2. Both new abort paths wrote evolved_FAILED.json but skipped the 'Saved failed variant to {path}' console line that existing abort paths print. Operators triaging a flake need to know the file was saved and where; added the print to both new paths. No behavior change for any test. * feat(evolve_tool): gate_decision.json schema v5 with CL-primary fields Schema bumps from v4 to v5 across all four gate_decision write sites (static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The bump is additive. New fields are present only when use_cl_primary == True: decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. When preflight was skipped (--no-saturation-check), records reason_synthetic: 'preflight_skipped' so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' cl_required_gain and synthetic_sanity_check reuse the CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD / CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the gate-decision payload can't drift from the actual gate logic. Existing v4 consumers see byte-identical output for synthetic-mode runs except the new decision_signal: 'synthetic' string. * test(evolve_tool): integration tests for CL-aware deploy gate 10 tests covering the deploy-gate branch on saturation band: - weak_signal triggers evolved CL eval (force_run called post-GEPA) - healthy/no_headroom fall through to synthetic - --no-saturation-check records reason_synthetic in JSON - all v5 fields present + correct types - force_run failure writes aborted decision with diagnostics - evolved task abstention writes cl_eval_incomplete (not regression) - absolute_char_ceiling still enforced in CL-primary path Mocks the synthetic dataset builder + closed-loop cache at the same seams as test_evolve_tool_saturation_preflight.py; calls evolve() directly (rather than via CliRunner) so each test can inspect gate_decision.json at a pinned output_dir. * test(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json Code-review feedback on the CL-aware gate test suite: 1. Test 1's force_run assertion was substring-based on str(call_args), which silently misses regressions where force_run is called twice or with extra kwargs. Tightened to assert_called_once_with. 2. Added test_uniform_failure_band_falls_through_to_synthetic_gate pinning the spec edge-case (uniform_failure -> synthetic path). Without it, expanding use_cl_primary to include uniform_failure would silently change behavior without a test failing. 3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is written, mirroring Test 8's assertion on the cl_eval_failed abort path. Production writes the file on both abort paths. * test(evolve_tool): schema v5 regression tests Pin the v4 → v5 additivity contract: every v4 field must still exist in v5 output, plus decision_signal (always) and the CL-specific fields (when use_cl_primary fired). Future schema bumps should add a TestSchemaV{N}Regression class following this pattern. * fix(evolve_tool): schema v5 consistency across all gate_decision write sites Final-review feedback caught two seam leaks: 1. write_cost_ceiling_abort was hard-coding schema_version=4. If the cost ceiling trips during the CL-primary force_run call, the resulting gate_decision.json had v4 in a v5 directory. Made the schema_version a keyword arg (default 4 for skill-side callers that haven't bumped yet); tool-side passes 5. 2. The static_constraint_failure payload was bumped to v5 in Task 4 but never had decision_signal added. Every other v5 path has it. Set to 'synthetic' since static-fail fires before any CL eval. 3. Extended TestSchemaV5Regression with abort-path coverage so the above issues couldn't have slipped through. Three new tests pin schema_version and decision_signal on cl_eval_failed, cl_eval_incomplete, and static_constraint_failure payloads. 4. Renamed test_accepts_at_pr_68_calibration_point to test_accepts_at_24char_baseline_calibration_point per the project convention against exposing internal PR numbers in code.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two checked-in fixtures that together create a stable
weak_signalbaseline in the saturation pre-flight, so we can retro-validate the--gepa-minibatch-sizeflag (PR #65) against an honest closed-loop signal.Background
PR #65 shipped Path E (
--gepa-minibatch-size) with a multi-seed smoke that we later discovered was contaminated by the hermes prefix-routing bug (fixed in PR #66). After that fix, every probe against the existingwrite_file.jsonlsuite either landed at uniform_failure (gpt-5-nano: 0/7) or no_headroom (gpt-5-mini: 7/7) — no model reliably landed inweak_signal. The model-strength axis is binary on the existing suite because its prompts disambiguate the right tool from the verbs alone ("create new file", "change X to Y") without consulting the tool description.This PR adds the missing piece: a suite where the right tool depends on understanding
write_file's wholesale-overwrite semantics, and a deliberately-weakened write_file description that doesn't communicate those semantics.What's added
evolution/validation/suites/write_file_ambiguous.jsonl— 7 closed-loop tasks. Existing files are populated viafixture_setupso write_file would WIPE content if used on a targeted-edit task. Categories:wf_*(3 tasks): write_file is the right tool; picking patch would fail to wholesale-replacept_*(3 tasks): patch is the right tool; picking write_file would wipe existing contentmixed_*(1 task): either tool acceptable (control)tests/fixtures/tool_manifests/weakened_write_file_manifest.json— JSON manifest with a 24-char write_file description ("Write content to a file.", omits the wholesale-overwrite cue) plus a near-production patch description as the confusable neighbor.Probe results (band confirmation)
Both descriptions land at 5/7 on the new suite — deterministic across seeds. The 2 failures are genuinely tied to the tasks (not LM noise), so GEPA has measurable headroom.
Smoke results
3 seeds at
--gepa-minibatch-size 8+ 1 control at--gepa-minibatch-size 3against the weakened fixture + ambiguous suite. ~$15 LM cost, ~7h sequential wall (the control was killed mid-run by an unrelated process death, but had completed 102 iterations giving us full data).Two clean validation signals:
The evolved description GEPA produced (seed 42):
"Write content to a file."(24 chars)"Write or overwrite file content, including new files."(53 chars)The evolved version added the overwrite cue the ambiguous suite specifically tests for. GEPA learned the right thing.
Verdict
Path E is the right ship at the mechanism level. Every accepted proposal got rejected by the deploy gate because the gate scores synthetic holdout (saturated at 1.000) and doesn't surface closed-loop wins. That's a separate framework gap, not a Path E problem.
Next workstream
Surfaced by this experiment: the deploy gate's blindness to closed-loop signal when synthetic is saturated. When
evolved_cl_score > baseline_cl_scoreand synthetic holdout is pegged, the gate should accept based on the CL improvement instead of rejecting for lack of synthetic delta. Held for a separate PR — needs design (how to weight signals when both have movement, how to handle CL's small N for confidence intervals, what the user-facing flag/config looks like).Test plan
uv run pytest -q— 1090 passed (no source changes; fixtures only).Band: weak_signal, Closed-loop: 0.714for the weakened fixture + ambiguous suite at seeds 42 and 43 with gpt-5-mini.