feat: deploy-gate CL-awareness#69
Merged
Merged
Conversation
Pure function returning a ConstraintResult for the closed-loop-primary deploy decision. Used when saturation pre-flight reports weak_signal band. Required gain scales with description growth, mirroring the synthetic gate's free_threshold + slope shape; synthetic regression tolerance of 0.05 protects against catastrophic judge collapse. 11 unit tests cover the decision-rule math including the PR #68 calibration point (+2 gain on +121% growth -> required 2, just passes) and wallpaper protection (+1 gain on +400% growth -> required 4, fails).
Today only sat_report.holdout_per_example survives past the preflight call site; subsequent CL-aware gate work needs the band classification and baseline CL per-task scores too. Bind four new locals next to the existing cache: band, cl_per_example, holdout_score, cl_score. All default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged.
When preflight reports weak_signal AND closed-loop is configured, run
a one-shot force_run on the evolved description and gate the deploy
decision on closed-loop signal via _check_cl_primary_gate.
Three abort paths are written to gate_decision.json with diagnostic
payloads (schema v5):
- cl_eval_failed: force_run raised an exception
- cl_eval_incomplete: one or more evolved CL tasks abstained
(runner errored — distinguished from genuine task failure via
the existing TaskResult.abstained field)
- cl_primary_gate reject: returned by the gate helper itself
_check_absolute_char_ceiling is preserved in the CL-primary path —
wallpaper protection is orthogonal to which signal we gate on. All
other bands (healthy / no_headroom / uniform_failure / no preflight)
fall through to the existing synthetic path unchanged.
Code review found two minor issues in the CL-primary branch added by ae1fea6: 1. cl_constraint: Optional[ConstraintResult] flows into a list[ConstraintResult] without type narrowing at the post-branch growth_constraints assignment. Added an assert so the type checker sees the correlation between the two 'if use_cl_primary:' blocks. 2. Both new abort paths wrote evolved_FAILED.json but skipped the 'Saved failed variant to {path}' console line that existing abort paths print. Operators triaging a flake need to know the file was saved and where; added the print to both new paths. No behavior change for any test.
Schema bumps from v4 to v5 across all four gate_decision write sites (static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The bump is additive. New fields are present only when use_cl_primary == True: decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. When preflight was skipped (--no-saturation-check), records reason_synthetic: 'preflight_skipped' so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' cl_required_gain and synthetic_sanity_check reuse the CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD / CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the gate-decision payload can't drift from the actual gate logic. Existing v4 consumers see byte-identical output for synthetic-mode runs except the new decision_signal: 'synthetic' string.
10 tests covering the deploy-gate branch on saturation band: - weak_signal triggers evolved CL eval (force_run called post-GEPA) - healthy/no_headroom fall through to synthetic - --no-saturation-check records reason_synthetic in JSON - all v5 fields present + correct types - force_run failure writes aborted decision with diagnostics - evolved task abstention writes cl_eval_incomplete (not regression) - absolute_char_ceiling still enforced in CL-primary path Mocks the synthetic dataset builder + closed-loop cache at the same seams as test_evolve_tool_saturation_preflight.py; calls evolve() directly (rather than via CliRunner) so each test can inspect gate_decision.json at a pinned output_dir.
…, pin evolved_FAILED.json Code-review feedback on the CL-aware gate test suite: 1. Test 1's force_run assertion was substring-based on str(call_args), which silently misses regressions where force_run is called twice or with extra kwargs. Tightened to assert_called_once_with. 2. Added test_uniform_failure_band_falls_through_to_synthetic_gate pinning the spec edge-case (uniform_failure -> synthetic path). Without it, expanding use_cl_primary to include uniform_failure would silently change behavior without a test failing. 3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is written, mirroring Test 8's assertion on the cl_eval_failed abort path. Production writes the file on both abort paths.
Pin the v4 → v5 additivity contract: every v4 field must still exist
in v5 output, plus decision_signal (always) and the CL-specific fields
(when use_cl_primary fired). Future schema bumps should add a
TestSchemaV{N}Regression class following this pattern.
…e sites Final-review feedback caught two seam leaks: 1. write_cost_ceiling_abort was hard-coding schema_version=4. If the cost ceiling trips during the CL-primary force_run call, the resulting gate_decision.json had v4 in a v5 directory. Made the schema_version a keyword arg (default 4 for skill-side callers that haven't bumped yet); tool-side passes 5. 2. The static_constraint_failure payload was bumped to v5 in Task 4 but never had decision_signal added. Every other v5 path has it. Set to 'synthetic' since static-fail fires before any CL eval. 3. Extended TestSchemaV5Regression with abort-path coverage so the above issues couldn't have slipped through. Three new tests pin schema_version and decision_signal on cl_eval_failed, cl_eval_incomplete, and static_constraint_failure payloads. 4. Renamed test_accepts_at_pr_68_calibration_point to test_accepts_at_24char_baseline_calibration_point per the project convention against exposing internal PR numbers in code.
3 tasks
jramos
added a commit
that referenced
this pull request
May 24, 2026
) * refactor(evolve_skill): preserve SaturationReport fields for deploy gate Symmetric to evolve_tool's CL-aware gate work. Today only sat_report.holdout_per_example survives past the preflight call site; the CL-aware deploy gate work needs band + cl_per_example + the two trigger scores too. Bind four new locals; all default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged. * feat(evolve_skill): branch deploy gate on saturation band Symmetric to evolve_tool's CL-aware gate (already on main). When preflight reports weak_signal AND closed-loop is configured, run force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate. Three abort paths writing schema-v5 gate_decision.json: - cl_eval_failed: force_run raised - cl_eval_incomplete: evolved CL task abstained (runner errored — distinguished from genuine task failure via TaskResult.abstained) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_chars preserved in CL-primary path; all other bands fall through to today's synthetic path unchanged. Key skill-specific differences from tool-side: - force_run called with evolved_body (no YAML frontmatter) to match the cache key the preflight set up with baseline body. - Abort path writes evolved_FAILED.md (matching existing skill reject convention), not evolved_FAILED.json. - growth_pct uses evolved_full vs skill["raw"] (existing convention). run_inputs literal hoisted to a local since the same dict now appears in 3 places (success + 2 abort paths). * feat(evolve_skill): gate_decision.json schema v5 with CL-primary fields Schema bumps from v4 to v5 (additive). New fields: decision_signal (always), baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model (only when use_cl_primary fired). reason_synthetic: 'preflight_skipped' when --no-saturation-check was passed, so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' Existing v4 consumers see byte-identical output for synthetic-mode skill runs aside from the new decision_signal string. v4-specific skill fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved. Lockstep with evolve_tool: both at schema v5 after this lands. * test(evolve_skill): integration + schema regression for CL-aware gate 13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py plus 3 skill-specific guards: - force_run called with body, not full (frontmatter key bug guard) - evolved_FAILED.md (not .json) on abort - v4 skill payload fields (bap_*, eval_source, fitness_profile, proposer_mode, knee_point.band_roster) preserved in v5 Mocks via the existing test_evolve_skill_saturation_preflight pattern. Calls evolve() directly rather than via CliRunner (matches the pattern PR #69 settled on after its own CliRunner deviation). * test(evolve_skill): pin CL-primary path in absolute-char-ceiling test Code review flagged test 10 as misnaming/misasserting the path under test (claiming static gate at evolve_skill.py:1034 fires first with decision_signal='synthetic'). Empirical verification shows validate_static at line 1034 only runs size_limit/non_empty/skill_structure — the absolute_char_ceiling check lives at line 1271 inside the use_cl_primary branch and runs AFTER force_run. Rejection carries decision_signal='closed_loop'. Keep the original (accurate) test name, expand the docstring to spell out the verified production flow, and add two assertions that pin the real CL-primary path: decision_signal == 'closed_loop' and force_run.assert_called_once_with(body). Also fix a minor miscount in the module docstring frontmatter math (was '43 chars', actually 42). * test: weakened systematic-debugging fixture for CL-aware-gate smoke Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl at seed 42 with openai/gpt-5-mini validator. Mirrors tests/fixtures/tool_manifests/weakened_write_file_manifest.json from the tool-side work — checked in so the post-merge manual smoke for the CL-aware skill deploy gate is reproducible. The fixture is a misdirection variant ("Python Bug Diagnostician") that looks plausible to the LLM judge (synthetic holdout ~0.95) while instructing the agent to write a diagnostic report instead of editing the file, so the planted-bug closed-loop suite scores in the 0.15-0.95 band rather than saturating. Empirical probe results (seed 42, eval-model gpt-4.1-mini, closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50): Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5) Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5) Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5) CL pass-rate varies run-to-run because the 5-task suite has 0.2 granularity and gpt-5-mini's adherence to the "don't edit" framing is non-deterministic, but every probe landed in the weak_signal band, which is what the manual smoke needs to exercise the CL-primary gate path. Earlier probes confirmed the binary-tier saturation trap from prior tool-side work: - Passive weakenings (v1: stripped methodology, v3: one-sentence skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and uniform_failure with gpt-5-nano (CL=0.0). The validator model capacity, not the skill content, drives the closed-loop signal on these planted bugs. - Only active misdirection (this fixture) creates the synthetic / closed-loop disagreement the weak_signal band requires.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tool-side deploy gate now reads closed-loop signal when the saturation pre-flight identifies the run as
weak_signalband. Closes the deploy-gap finding from the prior retro-validation that GEPA-accepted proposals (12 across 3 seeds, 12x mechanism win for Path E) were uniformly rejected because the synthetic holdout judge was saturated at 1.000 and thegrowth_quality_gaterequired structurally-impossible synthetic improvement.Behavior
weak_signalband: runsclosed_loop_cache.force_run(evolved_description)post-GEPA, gates the deploy decision on the new_check_cl_primary_gatehelper (CL gain ≥ growth-scaledrequired_gain) plus the preserved_check_absolute_char_ceiling(wallpaper protection). The prior smoke's case (+2 tasks gained, +121% growth from a 24-char baseline) lands at required=2, gain=2 → just barely passes.healthy,no_headroom,uniform_failure): today's synthetic-only gate path runs unchanged.--no-saturation-check: falls through to synthetic gate;gate_decision.jsonrecordsreason_synthetic: "preflight_skipped"for diagnosis.Failure modes (loud + diagnostic, not silent)
force_runraises → writesgate_decision.jsonwithdecision: "aborted",reason: "cl_eval_failed", exception string, evolved description path. User keeps diagnostics after $5–20 GEPA spend.decision: "aborted",reason: "cl_eval_incomplete"with the errored task IDs. Uses the existingTaskResult.abstainedfield; does NOT conflate infrastructure flakes with regression.Schema v5
Additive over v4. New fields surface only when the CL-primary path ran:
decision_signal,baseline_closed_loop_per_example,evolved_closed_loop_per_example,evolved_closed_loop_errored_tasks,cl_tasks_gained,cl_required_gain,synthetic_sanity_check,evolved_cl_eval_cost_usd,band_trigger_score,validator_agent_model. Synthetic-mode runs see byte-identical v4 fields plusdecision_signal: "synthetic".Commit sequence
feat(quality_gate): add _check_cl_primary_gate helper— pure decision-rule kernel + 11 unit testsrefactor(evolve_tool): preserve SaturationReport fields for deploy gate— plumbing, no behavior changefeat(evolve_tool): branch deploy gate on saturation band— main behavior changefix(evolve_tool): narrow cl_constraint type, surface saved-variant path— review fixesfeat(evolve_tool): gate_decision.json schema v5 with CL-primary fields— payload extensiontest(evolve_tool): integration tests for CL-aware deploy gate— 10 integration teststest(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json— review fixestest(evolve_tool): schema v5 regression tests— pin v4 → v5 additivityTest plan
uv run pytest -q— 1114 passed locally (+24 from this PR: 11 unit + 11 integration + 2 schema regression)Scope notes
"5"across all fourgate_decision.jsonwrite sites in this file (success/reject, static-constraint-failure,cl_eval_failed,cl_eval_incomplete).--closed-loop-mode feedbackvstrainsettreated identically (the deploy gate's branch logic is mode-agnostic).