test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation by jramos · Pull Request #68 · jramos/agent-self-evolution

jramos · 2026-05-23T04:23:41Z

Summary

Adds two checked-in fixtures that together create a stable weak_signal baseline in the saturation pre-flight, so we can retro-validate the --gepa-minibatch-size flag (PR #65) against an honest closed-loop signal.

Background

PR #65 shipped Path E (--gepa-minibatch-size) with a multi-seed smoke that we later discovered was contaminated by the hermes prefix-routing bug (fixed in PR #66). After that fix, every probe against the existing write_file.jsonl suite either landed at uniform_failure (gpt-5-nano: 0/7) or no_headroom (gpt-5-mini: 7/7) — no model reliably landed in weak_signal. The model-strength axis is binary on the existing suite because its prompts disambiguate the right tool from the verbs alone ("create new file", "change X to Y") without consulting the tool description.

This PR adds the missing piece: a suite where the right tool depends on understanding write_file's wholesale-overwrite semantics, and a deliberately-weakened write_file description that doesn't communicate those semantics.

What's added

evolution/validation/suites/write_file_ambiguous.jsonl — 7 closed-loop tasks. Existing files are populated via fixture_setup so write_file would WIPE content if used on a targeted-edit task. Categories:

wf_* (3 tasks): write_file is the right tool; picking patch would fail to wholesale-replace
pt_* (3 tasks): patch is the right tool; picking write_file would wipe existing content
mixed_* (1 task): either tool acceptable (control)

tests/fixtures/tool_manifests/weakened_write_file_manifest.json — JSON manifest with a 24-char write_file description ("Write content to a file.", omits the wholesale-overwrite cue) plus a near-production patch description as the confusable neighbor.

Probe results (band confirmation)

Manifest	Suite	Seed	Holdout	Closed-loop	Band
Production hermes	original write_file.jsonl	42	0.987	1.000 (7/7)	no_headroom
Production hermes	ambiguous	42	0.987	0.714 (5/7)	weak_signal
Weakened	ambiguous	42	1.000	0.714 (5/7)	weak_signal
Weakened	ambiguous	43	1.000	0.714 (5/7)	weak_signal

Both descriptions land at 5/7 on the new suite — deterministic across seeds. The 2 failures are genuinely tied to the tasks (not LM noise), so GEPA has measurable headroom.

Smoke results

3 seeds at --gepa-minibatch-size 8 + 1 control at --gepa-minibatch-size 3 against the weakened fixture + ambiguous suite. ~$15 LM cost, ~7h sequential wall (the control was killed mid-run by an unrelated process death, but had completed 102 iterations giving us full data).

Run	mb	Acceptances	Tie-rejects	All-perfect skips	Iters	Skip rate
seed 42	8	6	17	57	80	71%
seed 43	8	2	25	70	97	72%
seed 44	8	4	21	62	87	71%
seed 42 control	3	1	10	90	102	88%

Two clean validation signals:

mb=8 produced 12 total acceptances across 3 seeds; mb=3 control produced 1 in more iterations. 12x more proposal acceptance — exactly the mechanism Path E was designed to unblock.
mb=8 reduced the all-subsample-perfect short-circuit rate from 88% → 71%, matching hypergeometric expectation (wider minibatches contain a discriminating example more often).

The evolved description GEPA produced (seed 42):

Baseline: "Write content to a file." (24 chars)
Evolved: "Write or overwrite file content, including new files." (53 chars)

The evolved version added the overwrite cue the ambiguous suite specifically tests for. GEPA learned the right thing.

Verdict

Criterion	Status
Path E enables more proposal acceptances at GEPA gate	validated (12x vs control)
Path E reduces all-perfect short-circuit rate	validated (matches hypergeometric)
Path E produces semantically-better descriptions	validated (overwrite cue added)
Deploy gate surfaces the improvement	gap exposed (synthetic-only gate)
Control reproduces the original spike-#2 pattern	validated (1 acc vs 4 avg on mb=8)

Path E is the right ship at the mechanism level. Every accepted proposal got rejected by the deploy gate because the gate scores synthetic holdout (saturated at 1.000) and doesn't surface closed-loop wins. That's a separate framework gap, not a Path E problem.

Next workstream

Surfaced by this experiment: the deploy gate's blindness to closed-loop signal when synthetic is saturated. When evolved_cl_score > baseline_cl_score and synthetic holdout is pegged, the gate should accept based on the CL improvement instead of rejecting for lack of synthetic delta. Held for a separate PR — needs design (how to weight signals when both have movement, how to handle CL's small N for confidence intervals, what the user-facing flag/config looks like).

Test plan

uv run pytest -q — 1090 passed (no source changes; fixtures only).
Manual probe confirms Band: weak_signal, Closed-loop: 0.714 for the weakened fixture + ambiguous suite at seeds 42 and 43 with gpt-5-mini.
Full smoke: 12 GEPA acceptances across 3 mb=8 seeds vs 1 in mb=3 control (12x mechanism validation).

…lidation Two fixtures that together create a stable weak_signal baseline for the saturation pre-flight, enabling closed-loop retro-validation of the --gepa-minibatch-size flag against an honest behavioral signal. evolution/validation/suites/write_file_ambiguous.jsonl 7 closed-loop tasks where the right tool choice depends on understanding write_file's wholesale-overwrite semantics. Tasks populate existing files via fixture_setup so write_file would WIPE content if used wrongly on a targeted-edit task. tests/fixtures/tool_manifests/weakened_write_file_manifest.json A 24-char write_file description ("Write content to a file.") plus a near-production patch description as the confusable neighbor. The weakening omits the wholesale-overwrite cue. Probe-loop results (gpt-5-mini validator, openai/gpt-5-mini routed correctly post-PR-66): Strong description + ambiguous suite, seed 42 → CL 0.714, weak_signal Weakened fixture + ambiguous suite, seed 42 → CL 0.714, weak_signal Weakened fixture + ambiguous suite, seed 43 → CL 0.714, weak_signal Both descriptions land at 5/7 with the new suite, which means GEPA proposals have real headroom (CL can move toward 6/7 or 7/7). The synthetic judge stays saturated; closed-loop is the only gradient.

* feat(quality_gate): add _check_cl_primary_gate helper Pure function returning a ConstraintResult for the closed-loop-primary deploy decision. Used when saturation pre-flight reports weak_signal band. Required gain scales with description growth, mirroring the synthetic gate's free_threshold + slope shape; synthetic regression tolerance of 0.05 protects against catastrophic judge collapse. 11 unit tests cover the decision-rule math including the PR #68 calibration point (+2 gain on +121% growth -> required 2, just passes) and wallpaper protection (+1 gain on +400% growth -> required 4, fails). * refactor(evolve_tool): preserve SaturationReport fields for deploy gate Today only sat_report.holdout_per_example survives past the preflight call site; subsequent CL-aware gate work needs the band classification and baseline CL per-task scores too. Bind four new locals next to the existing cache: band, cl_per_example, holdout_score, cl_score. All default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged. * feat(evolve_tool): branch deploy gate on saturation band When preflight reports weak_signal AND closed-loop is configured, run a one-shot force_run on the evolved description and gate the deploy decision on closed-loop signal via _check_cl_primary_gate. Three abort paths are written to gate_decision.json with diagnostic payloads (schema v5): - cl_eval_failed: force_run raised an exception - cl_eval_incomplete: one or more evolved CL tasks abstained (runner errored — distinguished from genuine task failure via the existing TaskResult.abstained field) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_char_ceiling is preserved in the CL-primary path — wallpaper protection is orthogonal to which signal we gate on. All other bands (healthy / no_headroom / uniform_failure / no preflight) fall through to the existing synthetic path unchanged. * fix(evolve_tool): narrow cl_constraint type, surface saved-variant path Code review found two minor issues in the CL-primary branch added by ae1fea6: 1. cl_constraint: Optional[ConstraintResult] flows into a list[ConstraintResult] without type narrowing at the post-branch growth_constraints assignment. Added an assert so the type checker sees the correlation between the two 'if use_cl_primary:' blocks. 2. Both new abort paths wrote evolved_FAILED.json but skipped the 'Saved failed variant to {path}' console line that existing abort paths print. Operators triaging a flake need to know the file was saved and where; added the print to both new paths. No behavior change for any test. * feat(evolve_tool): gate_decision.json schema v5 with CL-primary fields Schema bumps from v4 to v5 across all four gate_decision write sites (static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The bump is additive. New fields are present only when use_cl_primary == True: decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. When preflight was skipped (--no-saturation-check), records reason_synthetic: 'preflight_skipped' so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' cl_required_gain and synthetic_sanity_check reuse the CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD / CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the gate-decision payload can't drift from the actual gate logic. Existing v4 consumers see byte-identical output for synthetic-mode runs except the new decision_signal: 'synthetic' string. * test(evolve_tool): integration tests for CL-aware deploy gate 10 tests covering the deploy-gate branch on saturation band: - weak_signal triggers evolved CL eval (force_run called post-GEPA) - healthy/no_headroom fall through to synthetic - --no-saturation-check records reason_synthetic in JSON - all v5 fields present + correct types - force_run failure writes aborted decision with diagnostics - evolved task abstention writes cl_eval_incomplete (not regression) - absolute_char_ceiling still enforced in CL-primary path Mocks the synthetic dataset builder + closed-loop cache at the same seams as test_evolve_tool_saturation_preflight.py; calls evolve() directly (rather than via CliRunner) so each test can inspect gate_decision.json at a pinned output_dir. * test(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json Code-review feedback on the CL-aware gate test suite: 1. Test 1's force_run assertion was substring-based on str(call_args), which silently misses regressions where force_run is called twice or with extra kwargs. Tightened to assert_called_once_with. 2. Added test_uniform_failure_band_falls_through_to_synthetic_gate pinning the spec edge-case (uniform_failure -> synthetic path). Without it, expanding use_cl_primary to include uniform_failure would silently change behavior without a test failing. 3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is written, mirroring Test 8's assertion on the cl_eval_failed abort path. Production writes the file on both abort paths. * test(evolve_tool): schema v5 regression tests Pin the v4 → v5 additivity contract: every v4 field must still exist in v5 output, plus decision_signal (always) and the CL-specific fields (when use_cl_primary fired). Future schema bumps should add a TestSchemaV{N}Regression class following this pattern. * fix(evolve_tool): schema v5 consistency across all gate_decision write sites Final-review feedback caught two seam leaks: 1. write_cost_ceiling_abort was hard-coding schema_version=4. If the cost ceiling trips during the CL-primary force_run call, the resulting gate_decision.json had v4 in a v5 directory. Made the schema_version a keyword arg (default 4 for skill-side callers that haven't bumped yet); tool-side passes 5. 2. The static_constraint_failure payload was bumped to v5 in Task 4 but never had decision_signal added. Every other v5 path has it. Set to 'synthetic' since static-fail fires before any CL eval. 3. Extended TestSchemaV5Regression with abort-path coverage so the above issues couldn't have slipped through. Three new tests pin schema_version and decision_signal on cl_eval_failed, cl_eval_incomplete, and static_constraint_failure payloads. 4. Renamed test_accepts_at_pr_68_calibration_point to test_accepts_at_24char_baseline_calibration_point per the project convention against exposing internal PR numbers in code.

jramos merged commit 4449fbd into main May 23, 2026
4 checks passed

jramos deleted the validate/path-e-weakened-baseline branch May 23, 2026 13:24

jramos mentioned this pull request May 23, 2026

feat: deploy-gate CL-awareness #69

Merged

3 tasks

jramos mentioned this pull request May 23, 2026

feat: skill-side deploy-gate CL-awareness (lockstep with tool side) #70

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation#68

test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation#68
jramos merged 1 commit into
mainfrom
validate/path-e-weakened-baseline

jramos commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

What's added

Probe results (band confirmation)

Smoke results

Verdict

Next workstream

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 23, 2026 •

edited

Loading