feat: deploy-gate CL-awareness by jramos · Pull Request #69 · jramos/agent-self-evolution

jramos · 2026-05-23T15:15:18Z

Summary

Tool-side deploy gate now reads closed-loop signal when the saturation pre-flight identifies the run as weak_signal band. Closes the deploy-gap finding from the prior retro-validation that GEPA-accepted proposals (12 across 3 seeds, 12x mechanism win for Path E) were uniformly rejected because the synthetic holdout judge was saturated at 1.000 and the growth_quality_gate required structurally-impossible synthetic improvement.

Behavior

weak_signal band: runs closed_loop_cache.force_run(evolved_description) post-GEPA, gates the deploy decision on the new _check_cl_primary_gate helper (CL gain ≥ growth-scaled required_gain) plus the preserved _check_absolute_char_ceiling (wallpaper protection). The prior smoke's case (+2 tasks gained, +121% growth from a 24-char baseline) lands at required=2, gain=2 → just barely passes.
All other bands (healthy, no_headroom, uniform_failure): today's synthetic-only gate path runs unchanged.
--no-saturation-check: falls through to synthetic gate; gate_decision.json records reason_synthetic: "preflight_skipped" for diagnosis.

Failure modes (loud + diagnostic, not silent)

force_run raises → writes gate_decision.json with decision: "aborted", reason: "cl_eval_failed", exception string, evolved description path. User keeps diagnostics after $5–20 GEPA spend.
Evolved CL task abstained (runner errored) → writes decision: "aborted", reason: "cl_eval_incomplete" with the errored task IDs. Uses the existing TaskResult.abstained field; does NOT conflate infrastructure flakes with regression.

Schema v5

Additive over v4. New fields surface only when the CL-primary path ran:
decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. Synthetic-mode runs see byte-identical v4 fields plus decision_signal: "synthetic".

Commit sequence

feat(quality_gate): add _check_cl_primary_gate helper — pure decision-rule kernel + 11 unit tests
refactor(evolve_tool): preserve SaturationReport fields for deploy gate — plumbing, no behavior change
feat(evolve_tool): branch deploy gate on saturation band — main behavior change
fix(evolve_tool): narrow cl_constraint type, surface saved-variant path — review fixes
feat(evolve_tool): gate_decision.json schema v5 with CL-primary fields — payload extension
test(evolve_tool): integration tests for CL-aware deploy gate — 10 integration tests
test(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json — review fixes
test(evolve_tool): schema v5 regression tests — pin v4 → v5 additivity

Test plan

uv run pytest -q — 1114 passed locally (+24 from this PR: 11 unit + 11 integration + 2 schema regression)
CI green across 4 Python versions
Optional manual smoke (~$3–5) confirming the prior smoke's case (weakened fixture + ambiguous suite, seed 42, mb=8) now deploys instead of being rejected — adds the empirical end-to-end evidence

Scope notes

Tool-side only. Skill-side equivalent is a deliberate follow-up.
No new CLI flag. Trigger is automatic on the existing band classification.
Schema bumps to "5" across all four gate_decision.json write sites in this file (success/reject, static-constraint-failure, cl_eval_failed, cl_eval_incomplete).
--closed-loop-mode feedback vs trainset treated identically (the deploy gate's branch logic is mode-agnostic).

Pure function returning a ConstraintResult for the closed-loop-primary deploy decision. Used when saturation pre-flight reports weak_signal band. Required gain scales with description growth, mirroring the synthetic gate's free_threshold + slope shape; synthetic regression tolerance of 0.05 protects against catastrophic judge collapse. 11 unit tests cover the decision-rule math including the PR #68 calibration point (+2 gain on +121% growth -> required 2, just passes) and wallpaper protection (+1 gain on +400% growth -> required 4, fails).

Today only sat_report.holdout_per_example survives past the preflight call site; subsequent CL-aware gate work needs the band classification and baseline CL per-task scores too. Bind four new locals next to the existing cache: band, cl_per_example, holdout_score, cl_score. All default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged.

When preflight reports weak_signal AND closed-loop is configured, run a one-shot force_run on the evolved description and gate the deploy decision on closed-loop signal via _check_cl_primary_gate. Three abort paths are written to gate_decision.json with diagnostic payloads (schema v5): - cl_eval_failed: force_run raised an exception - cl_eval_incomplete: one or more evolved CL tasks abstained (runner errored — distinguished from genuine task failure via the existing TaskResult.abstained field) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_char_ceiling is preserved in the CL-primary path — wallpaper protection is orthogonal to which signal we gate on. All other bands (healthy / no_headroom / uniform_failure / no preflight) fall through to the existing synthetic path unchanged.

Code review found two minor issues in the CL-primary branch added by ae1fea6: 1. cl_constraint: Optional[ConstraintResult] flows into a list[ConstraintResult] without type narrowing at the post-branch growth_constraints assignment. Added an assert so the type checker sees the correlation between the two 'if use_cl_primary:' blocks. 2. Both new abort paths wrote evolved_FAILED.json but skipped the 'Saved failed variant to {path}' console line that existing abort paths print. Operators triaging a flake need to know the file was saved and where; added the print to both new paths. No behavior change for any test.

Schema bumps from v4 to v5 across all four gate_decision write sites (static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The bump is additive. New fields are present only when use_cl_primary == True: decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. When preflight was skipped (--no-saturation-check), records reason_synthetic: 'preflight_skipped' so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' cl_required_gain and synthetic_sanity_check reuse the CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD / CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the gate-decision payload can't drift from the actual gate logic. Existing v4 consumers see byte-identical output for synthetic-mode runs except the new decision_signal: 'synthetic' string.

10 tests covering the deploy-gate branch on saturation band: - weak_signal triggers evolved CL eval (force_run called post-GEPA) - healthy/no_headroom fall through to synthetic - --no-saturation-check records reason_synthetic in JSON - all v5 fields present + correct types - force_run failure writes aborted decision with diagnostics - evolved task abstention writes cl_eval_incomplete (not regression) - absolute_char_ceiling still enforced in CL-primary path Mocks the synthetic dataset builder + closed-loop cache at the same seams as test_evolve_tool_saturation_preflight.py; calls evolve() directly (rather than via CliRunner) so each test can inspect gate_decision.json at a pinned output_dir.

…, pin evolved_FAILED.json Code-review feedback on the CL-aware gate test suite: 1. Test 1's force_run assertion was substring-based on str(call_args), which silently misses regressions where force_run is called twice or with extra kwargs. Tightened to assert_called_once_with. 2. Added test_uniform_failure_band_falls_through_to_synthetic_gate pinning the spec edge-case (uniform_failure -> synthetic path). Without it, expanding use_cl_primary to include uniform_failure would silently change behavior without a test failing. 3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is written, mirroring Test 8's assertion on the cl_eval_failed abort path. Production writes the file on both abort paths.

Pin the v4 → v5 additivity contract: every v4 field must still exist in v5 output, plus decision_signal (always) and the CL-specific fields (when use_cl_primary fired). Future schema bumps should add a TestSchemaV{N}Regression class following this pattern.

…e sites Final-review feedback caught two seam leaks: 1. write_cost_ceiling_abort was hard-coding schema_version=4. If the cost ceiling trips during the CL-primary force_run call, the resulting gate_decision.json had v4 in a v5 directory. Made the schema_version a keyword arg (default 4 for skill-side callers that haven't bumped yet); tool-side passes 5. 2. The static_constraint_failure payload was bumped to v5 in Task 4 but never had decision_signal added. Every other v5 path has it. Set to 'synthetic' since static-fail fires before any CL eval. 3. Extended TestSchemaV5Regression with abort-path coverage so the above issues couldn't have slipped through. Three new tests pin schema_version and decision_signal on cl_eval_failed, cl_eval_incomplete, and static_constraint_failure payloads. 4. Renamed test_accepts_at_pr_68_calibration_point to test_accepts_at_24char_baseline_calibration_point per the project convention against exposing internal PR numbers in code.

) * refactor(evolve_skill): preserve SaturationReport fields for deploy gate Symmetric to evolve_tool's CL-aware gate work. Today only sat_report.holdout_per_example survives past the preflight call site; the CL-aware deploy gate work needs band + cl_per_example + the two trigger scores too. Bind four new locals; all default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged. * feat(evolve_skill): branch deploy gate on saturation band Symmetric to evolve_tool's CL-aware gate (already on main). When preflight reports weak_signal AND closed-loop is configured, run force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate. Three abort paths writing schema-v5 gate_decision.json: - cl_eval_failed: force_run raised - cl_eval_incomplete: evolved CL task abstained (runner errored — distinguished from genuine task failure via TaskResult.abstained) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_chars preserved in CL-primary path; all other bands fall through to today's synthetic path unchanged. Key skill-specific differences from tool-side: - force_run called with evolved_body (no YAML frontmatter) to match the cache key the preflight set up with baseline body. - Abort path writes evolved_FAILED.md (matching existing skill reject convention), not evolved_FAILED.json. - growth_pct uses evolved_full vs skill["raw"] (existing convention). run_inputs literal hoisted to a local since the same dict now appears in 3 places (success + 2 abort paths). * feat(evolve_skill): gate_decision.json schema v5 with CL-primary fields Schema bumps from v4 to v5 (additive). New fields: decision_signal (always), baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model (only when use_cl_primary fired). reason_synthetic: 'preflight_skipped' when --no-saturation-check was passed, so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' Existing v4 consumers see byte-identical output for synthetic-mode skill runs aside from the new decision_signal string. v4-specific skill fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved. Lockstep with evolve_tool: both at schema v5 after this lands. * test(evolve_skill): integration + schema regression for CL-aware gate 13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py plus 3 skill-specific guards: - force_run called with body, not full (frontmatter key bug guard) - evolved_FAILED.md (not .json) on abort - v4 skill payload fields (bap_*, eval_source, fitness_profile, proposer_mode, knee_point.band_roster) preserved in v5 Mocks via the existing test_evolve_skill_saturation_preflight pattern. Calls evolve() directly rather than via CliRunner (matches the pattern PR #69 settled on after its own CliRunner deviation). * test(evolve_skill): pin CL-primary path in absolute-char-ceiling test Code review flagged test 10 as misnaming/misasserting the path under test (claiming static gate at evolve_skill.py:1034 fires first with decision_signal='synthetic'). Empirical verification shows validate_static at line 1034 only runs size_limit/non_empty/skill_structure — the absolute_char_ceiling check lives at line 1271 inside the use_cl_primary branch and runs AFTER force_run. Rejection carries decision_signal='closed_loop'. Keep the original (accurate) test name, expand the docstring to spell out the verified production flow, and add two assertions that pin the real CL-primary path: decision_signal == 'closed_loop' and force_run.assert_called_once_with(body). Also fix a minor miscount in the module docstring frontmatter math (was '43 chars', actually 42). * test: weakened systematic-debugging fixture for CL-aware-gate smoke Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl at seed 42 with openai/gpt-5-mini validator. Mirrors tests/fixtures/tool_manifests/weakened_write_file_manifest.json from the tool-side work — checked in so the post-merge manual smoke for the CL-aware skill deploy gate is reproducible. The fixture is a misdirection variant ("Python Bug Diagnostician") that looks plausible to the LLM judge (synthetic holdout ~0.95) while instructing the agent to write a diagnostic report instead of editing the file, so the planted-bug closed-loop suite scores in the 0.15-0.95 band rather than saturating. Empirical probe results (seed 42, eval-model gpt-4.1-mini, closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50): Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5) Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5) Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5) CL pass-rate varies run-to-run because the 5-task suite has 0.2 granularity and gpt-5-mini's adherence to the "don't edit" framing is non-deterministic, but every probe landed in the weak_signal band, which is what the manual smoke needs to exercise the CL-primary gate path. Earlier probes confirmed the binary-tier saturation trap from prior tool-side work: - Passive weakenings (v1: stripped methodology, v3: one-sentence skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and uniform_failure with gpt-5-nano (CL=0.0). The validator model capacity, not the skill content, drives the closed-loop signal on these planted bugs. - Only active misdirection (this fixture) creates the synthetic / closed-loop disagreement the weak_signal band requires.

jramos added 9 commits May 23, 2026 08:06

jramos changed the title ~~feat: deploy-gate CL-awareness (closes the PR #68 deploy-gap finding)~~ feat: deploy-gate CL-awareness May 23, 2026

jramos merged commit 2ea480e into main May 23, 2026
7 of 8 checks passed

jramos deleted the feat/deploy-gate-cl-awareness branch May 23, 2026 19:24

jramos mentioned this pull request May 23, 2026

feat: skill-side deploy-gate CL-awareness (lockstep with tool side) #70

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deploy-gate CL-awareness#69

feat: deploy-gate CL-awareness#69
jramos merged 9 commits into
mainfrom
feat/deploy-gate-cl-awareness

jramos commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Failure modes (loud + diagnostic, not silent)

Schema v5

Commit sequence

Test plan

Scope notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 23, 2026 •

edited

Loading