Skip to content

test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation#68

Merged
jramos merged 1 commit into
mainfrom
validate/path-e-weakened-baseline
May 23, 2026
Merged

test: weakened write_file fixture + ambiguous-task suite for Path E retro-validation#68
jramos merged 1 commit into
mainfrom
validate/path-e-weakened-baseline

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 23, 2026

Summary

Adds two checked-in fixtures that together create a stable weak_signal baseline in the saturation pre-flight, so we can retro-validate the --gepa-minibatch-size flag (PR #65) against an honest closed-loop signal.

Background

PR #65 shipped Path E (--gepa-minibatch-size) with a multi-seed smoke that we later discovered was contaminated by the hermes prefix-routing bug (fixed in PR #66). After that fix, every probe against the existing write_file.jsonl suite either landed at uniform_failure (gpt-5-nano: 0/7) or no_headroom (gpt-5-mini: 7/7) — no model reliably landed in weak_signal. The model-strength axis is binary on the existing suite because its prompts disambiguate the right tool from the verbs alone ("create new file", "change X to Y") without consulting the tool description.

This PR adds the missing piece: a suite where the right tool depends on understanding write_file's wholesale-overwrite semantics, and a deliberately-weakened write_file description that doesn't communicate those semantics.

What's added

evolution/validation/suites/write_file_ambiguous.jsonl — 7 closed-loop tasks. Existing files are populated via fixture_setup so write_file would WIPE content if used on a targeted-edit task. Categories:

  • wf_* (3 tasks): write_file is the right tool; picking patch would fail to wholesale-replace
  • pt_* (3 tasks): patch is the right tool; picking write_file would wipe existing content
  • mixed_* (1 task): either tool acceptable (control)

tests/fixtures/tool_manifests/weakened_write_file_manifest.json — JSON manifest with a 24-char write_file description ("Write content to a file.", omits the wholesale-overwrite cue) plus a near-production patch description as the confusable neighbor.

Probe results (band confirmation)

Manifest Suite Seed Holdout Closed-loop Band
Production hermes original write_file.jsonl 42 0.987 1.000 (7/7) no_headroom
Production hermes ambiguous 42 0.987 0.714 (5/7) weak_signal
Weakened ambiguous 42 1.000 0.714 (5/7) weak_signal
Weakened ambiguous 43 1.000 0.714 (5/7) weak_signal

Both descriptions land at 5/7 on the new suite — deterministic across seeds. The 2 failures are genuinely tied to the tasks (not LM noise), so GEPA has measurable headroom.

Smoke results

3 seeds at --gepa-minibatch-size 8 + 1 control at --gepa-minibatch-size 3 against the weakened fixture + ambiguous suite. ~$15 LM cost, ~7h sequential wall (the control was killed mid-run by an unrelated process death, but had completed 102 iterations giving us full data).

Run mb Acceptances Tie-rejects All-perfect skips Iters Skip rate
seed 42 8 6 17 57 80 71%
seed 43 8 2 25 70 97 72%
seed 44 8 4 21 62 87 71%
seed 42 control 3 1 10 90 102 88%

Two clean validation signals:

  1. mb=8 produced 12 total acceptances across 3 seeds; mb=3 control produced 1 in more iterations. 12x more proposal acceptance — exactly the mechanism Path E was designed to unblock.
  2. mb=8 reduced the all-subsample-perfect short-circuit rate from 88% → 71%, matching hypergeometric expectation (wider minibatches contain a discriminating example more often).

The evolved description GEPA produced (seed 42):

  • Baseline: "Write content to a file." (24 chars)
  • Evolved: "Write or overwrite file content, including new files." (53 chars)

The evolved version added the overwrite cue the ambiguous suite specifically tests for. GEPA learned the right thing.

Verdict

Criterion Status
Path E enables more proposal acceptances at GEPA gate validated (12x vs control)
Path E reduces all-perfect short-circuit rate validated (matches hypergeometric)
Path E produces semantically-better descriptions validated (overwrite cue added)
Deploy gate surfaces the improvement gap exposed (synthetic-only gate)
Control reproduces the original spike-#2 pattern validated (1 acc vs 4 avg on mb=8)

Path E is the right ship at the mechanism level. Every accepted proposal got rejected by the deploy gate because the gate scores synthetic holdout (saturated at 1.000) and doesn't surface closed-loop wins. That's a separate framework gap, not a Path E problem.

Next workstream

Surfaced by this experiment: the deploy gate's blindness to closed-loop signal when synthetic is saturated. When evolved_cl_score > baseline_cl_score and synthetic holdout is pegged, the gate should accept based on the CL improvement instead of rejecting for lack of synthetic delta. Held for a separate PR — needs design (how to weight signals when both have movement, how to handle CL's small N for confidence intervals, what the user-facing flag/config looks like).

Test plan

  • uv run pytest -q — 1090 passed (no source changes; fixtures only).
  • Manual probe confirms Band: weak_signal, Closed-loop: 0.714 for the weakened fixture + ambiguous suite at seeds 42 and 43 with gpt-5-mini.
  • Full smoke: 12 GEPA acceptances across 3 mb=8 seeds vs 1 in mb=3 control (12x mechanism validation).

…lidation

Two fixtures that together create a stable weak_signal baseline for the
saturation pre-flight, enabling closed-loop retro-validation of the
--gepa-minibatch-size flag against an honest behavioral signal.

evolution/validation/suites/write_file_ambiguous.jsonl
  7 closed-loop tasks where the right tool choice depends on
  understanding write_file's wholesale-overwrite semantics. Tasks
  populate existing files via fixture_setup so write_file would WIPE
  content if used wrongly on a targeted-edit task.

tests/fixtures/tool_manifests/weakened_write_file_manifest.json
  A 24-char write_file description ("Write content to a file.") plus a
  near-production patch description as the confusable neighbor. The
  weakening omits the wholesale-overwrite cue.

Probe-loop results (gpt-5-mini validator, openai/gpt-5-mini routed
correctly post-PR-66):
  Strong description + ambiguous suite, seed 42 → CL 0.714, weak_signal
  Weakened fixture + ambiguous suite, seed 42 → CL 0.714, weak_signal
  Weakened fixture + ambiguous suite, seed 43 → CL 0.714, weak_signal

Both descriptions land at 5/7 with the new suite, which means GEPA
proposals have real headroom (CL can move toward 6/7 or 7/7). The
synthetic judge stays saturated; closed-loop is the only gradient.
@jramos jramos merged commit 4449fbd into main May 23, 2026
4 checks passed
@jramos jramos deleted the validate/path-e-weakened-baseline branch May 23, 2026 13:24
@jramos jramos mentioned this pull request May 23, 2026
3 tasks
jramos added a commit that referenced this pull request May 23, 2026
* feat(quality_gate): add _check_cl_primary_gate helper

Pure function returning a ConstraintResult for the closed-loop-primary
deploy decision. Used when saturation pre-flight reports weak_signal
band. Required gain scales with description growth, mirroring the
synthetic gate's free_threshold + slope shape; synthetic regression
tolerance of 0.05 protects against catastrophic judge collapse.

11 unit tests cover the decision-rule math including the PR #68
calibration point (+2 gain on +121% growth -> required 2, just passes)
and wallpaper protection (+1 gain on +400% growth -> required 4, fails).

* refactor(evolve_tool): preserve SaturationReport fields for deploy gate

Today only sat_report.holdout_per_example survives past the preflight
call site; subsequent CL-aware gate work needs the band classification
and baseline CL per-task scores too. Bind four new locals next to the
existing cache: band, cl_per_example, holdout_score, cl_score. All
default to None on the --no-saturation-check path so the deploy gate
can branch safely.

No behavior change; existing tests pass unchanged.

* feat(evolve_tool): branch deploy gate on saturation band

When preflight reports weak_signal AND closed-loop is configured, run
a one-shot force_run on the evolved description and gate the deploy
decision on closed-loop signal via _check_cl_primary_gate.

Three abort paths are written to gate_decision.json with diagnostic
payloads (schema v5):
  - cl_eval_failed: force_run raised an exception
  - cl_eval_incomplete: one or more evolved CL tasks abstained
    (runner errored — distinguished from genuine task failure via
    the existing TaskResult.abstained field)
  - cl_primary_gate reject: returned by the gate helper itself

_check_absolute_char_ceiling is preserved in the CL-primary path —
wallpaper protection is orthogonal to which signal we gate on. All
other bands (healthy / no_headroom / uniform_failure / no preflight)
fall through to the existing synthetic path unchanged.

* fix(evolve_tool): narrow cl_constraint type, surface saved-variant path

Code review found two minor issues in the CL-primary branch added by
ae1fea6:

1. cl_constraint: Optional[ConstraintResult] flows into a
   list[ConstraintResult] without type narrowing at the post-branch
   growth_constraints assignment. Added an assert so the type checker
   sees the correlation between the two 'if use_cl_primary:' blocks.

2. Both new abort paths wrote evolved_FAILED.json but skipped the
   'Saved failed variant to {path}' console line that existing abort
   paths print. Operators triaging a flake need to know the file was
   saved and where; added the print to both new paths.

No behavior change for any test.

* feat(evolve_tool): gate_decision.json schema v5 with CL-primary fields

Schema bumps from v4 to v5 across all four gate_decision write sites
(static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The
bump is additive. New fields are present only when use_cl_primary == True:
  decision_signal, baseline_closed_loop_per_example,
  evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks,
  cl_tasks_gained, cl_required_gain, synthetic_sanity_check,
  evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model.

When preflight was skipped (--no-saturation-check), records
reason_synthetic: 'preflight_skipped' so downstream consumers can
distinguish 'preflight saw no weak_signal' from 'preflight didn't run.'

cl_required_gain and synthetic_sanity_check reuse the
CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD /
CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the
gate-decision payload can't drift from the actual gate logic.

Existing v4 consumers see byte-identical output for synthetic-mode
runs except the new decision_signal: 'synthetic' string.

* test(evolve_tool): integration tests for CL-aware deploy gate

10 tests covering the deploy-gate branch on saturation band:
  - weak_signal triggers evolved CL eval (force_run called post-GEPA)
  - healthy/no_headroom fall through to synthetic
  - --no-saturation-check records reason_synthetic in JSON
  - all v5 fields present + correct types
  - force_run failure writes aborted decision with diagnostics
  - evolved task abstention writes cl_eval_incomplete (not regression)
  - absolute_char_ceiling still enforced in CL-primary path

Mocks the synthetic dataset builder + closed-loop cache at the same
seams as test_evolve_tool_saturation_preflight.py; calls evolve()
directly (rather than via CliRunner) so each test can inspect
gate_decision.json at a pinned output_dir.

* test(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json

Code-review feedback on the CL-aware gate test suite:

1. Test 1's force_run assertion was substring-based on str(call_args),
   which silently misses regressions where force_run is called twice
   or with extra kwargs. Tightened to assert_called_once_with.

2. Added test_uniform_failure_band_falls_through_to_synthetic_gate
   pinning the spec edge-case (uniform_failure -> synthetic path).
   Without it, expanding use_cl_primary to include uniform_failure
   would silently change behavior without a test failing.

3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is
   written, mirroring Test 8's assertion on the cl_eval_failed
   abort path. Production writes the file on both abort paths.

* test(evolve_tool): schema v5 regression tests

Pin the v4 → v5 additivity contract: every v4 field must still exist
in v5 output, plus decision_signal (always) and the CL-specific fields
(when use_cl_primary fired). Future schema bumps should add a
TestSchemaV{N}Regression class following this pattern.

* fix(evolve_tool): schema v5 consistency across all gate_decision write sites

Final-review feedback caught two seam leaks:

1. write_cost_ceiling_abort was hard-coding schema_version=4. If the
   cost ceiling trips during the CL-primary force_run call, the
   resulting gate_decision.json had v4 in a v5 directory. Made the
   schema_version a keyword arg (default 4 for skill-side callers
   that haven't bumped yet); tool-side passes 5.

2. The static_constraint_failure payload was bumped to v5 in Task 4
   but never had decision_signal added. Every other v5 path has it.
   Set to 'synthetic' since static-fail fires before any CL eval.

3. Extended TestSchemaV5Regression with abort-path coverage so the
   above issues couldn't have slipped through. Three new tests pin
   schema_version and decision_signal on cl_eval_failed,
   cl_eval_incomplete, and static_constraint_failure payloads.

4. Renamed test_accepts_at_pr_68_calibration_point to
   test_accepts_at_24char_baseline_calibration_point per the project
   convention against exposing internal PR numbers in code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant