Skip to content

feat: deploy-gate CL-awareness#69

Merged
jramos merged 9 commits into
mainfrom
feat/deploy-gate-cl-awareness
May 23, 2026
Merged

feat: deploy-gate CL-awareness#69
jramos merged 9 commits into
mainfrom
feat/deploy-gate-cl-awareness

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 23, 2026

Summary

Tool-side deploy gate now reads closed-loop signal when the saturation pre-flight identifies the run as weak_signal band. Closes the deploy-gap finding from the prior retro-validation that GEPA-accepted proposals (12 across 3 seeds, 12x mechanism win for Path E) were uniformly rejected because the synthetic holdout judge was saturated at 1.000 and the growth_quality_gate required structurally-impossible synthetic improvement.

Behavior

  • weak_signal band: runs closed_loop_cache.force_run(evolved_description) post-GEPA, gates the deploy decision on the new _check_cl_primary_gate helper (CL gain ≥ growth-scaled required_gain) plus the preserved _check_absolute_char_ceiling (wallpaper protection). The prior smoke's case (+2 tasks gained, +121% growth from a 24-char baseline) lands at required=2, gain=2 → just barely passes.
  • All other bands (healthy, no_headroom, uniform_failure): today's synthetic-only gate path runs unchanged.
  • --no-saturation-check: falls through to synthetic gate; gate_decision.json records reason_synthetic: "preflight_skipped" for diagnosis.

Failure modes (loud + diagnostic, not silent)

  • force_run raises → writes gate_decision.json with decision: "aborted", reason: "cl_eval_failed", exception string, evolved description path. User keeps diagnostics after $5–20 GEPA spend.
  • Evolved CL task abstained (runner errored) → writes decision: "aborted", reason: "cl_eval_incomplete" with the errored task IDs. Uses the existing TaskResult.abstained field; does NOT conflate infrastructure flakes with regression.

Schema v5

Additive over v4. New fields surface only when the CL-primary path ran:
decision_signal, baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model. Synthetic-mode runs see byte-identical v4 fields plus decision_signal: "synthetic".

Commit sequence

  1. feat(quality_gate): add _check_cl_primary_gate helper — pure decision-rule kernel + 11 unit tests
  2. refactor(evolve_tool): preserve SaturationReport fields for deploy gate — plumbing, no behavior change
  3. feat(evolve_tool): branch deploy gate on saturation band — main behavior change
  4. fix(evolve_tool): narrow cl_constraint type, surface saved-variant path — review fixes
  5. feat(evolve_tool): gate_decision.json schema v5 with CL-primary fields — payload extension
  6. test(evolve_tool): integration tests for CL-aware deploy gate — 10 integration tests
  7. test(evolve_tool): tighten Test 1 assertion, add uniform_failure test, pin evolved_FAILED.json — review fixes
  8. test(evolve_tool): schema v5 regression tests — pin v4 → v5 additivity

Test plan

  • uv run pytest -q — 1114 passed locally (+24 from this PR: 11 unit + 11 integration + 2 schema regression)
  • CI green across 4 Python versions
  • Optional manual smoke (~$3–5) confirming the prior smoke's case (weakened fixture + ambiguous suite, seed 42, mb=8) now deploys instead of being rejected — adds the empirical end-to-end evidence

Scope notes

  • Tool-side only. Skill-side equivalent is a deliberate follow-up.
  • No new CLI flag. Trigger is automatic on the existing band classification.
  • Schema bumps to "5" across all four gate_decision.json write sites in this file (success/reject, static-constraint-failure, cl_eval_failed, cl_eval_incomplete).
  • --closed-loop-mode feedback vs trainset treated identically (the deploy gate's branch logic is mode-agnostic).

jramos added 9 commits May 23, 2026 08:06
Pure function returning a ConstraintResult for the closed-loop-primary
deploy decision. Used when saturation pre-flight reports weak_signal
band. Required gain scales with description growth, mirroring the
synthetic gate's free_threshold + slope shape; synthetic regression
tolerance of 0.05 protects against catastrophic judge collapse.

11 unit tests cover the decision-rule math including the PR #68
calibration point (+2 gain on +121% growth -> required 2, just passes)
and wallpaper protection (+1 gain on +400% growth -> required 4, fails).
Today only sat_report.holdout_per_example survives past the preflight
call site; subsequent CL-aware gate work needs the band classification
and baseline CL per-task scores too. Bind four new locals next to the
existing cache: band, cl_per_example, holdout_score, cl_score. All
default to None on the --no-saturation-check path so the deploy gate
can branch safely.

No behavior change; existing tests pass unchanged.
When preflight reports weak_signal AND closed-loop is configured, run
a one-shot force_run on the evolved description and gate the deploy
decision on closed-loop signal via _check_cl_primary_gate.

Three abort paths are written to gate_decision.json with diagnostic
payloads (schema v5):
  - cl_eval_failed: force_run raised an exception
  - cl_eval_incomplete: one or more evolved CL tasks abstained
    (runner errored — distinguished from genuine task failure via
    the existing TaskResult.abstained field)
  - cl_primary_gate reject: returned by the gate helper itself

_check_absolute_char_ceiling is preserved in the CL-primary path —
wallpaper protection is orthogonal to which signal we gate on. All
other bands (healthy / no_headroom / uniform_failure / no preflight)
fall through to the existing synthetic path unchanged.
Code review found two minor issues in the CL-primary branch added by
ae1fea6:

1. cl_constraint: Optional[ConstraintResult] flows into a
   list[ConstraintResult] without type narrowing at the post-branch
   growth_constraints assignment. Added an assert so the type checker
   sees the correlation between the two 'if use_cl_primary:' blocks.

2. Both new abort paths wrote evolved_FAILED.json but skipped the
   'Saved failed variant to {path}' console line that existing abort
   paths print. Operators triaging a flake need to know the file was
   saved and where; added the print to both new paths.

No behavior change for any test.
Schema bumps from v4 to v5 across all four gate_decision write sites
(static-fail, cl_eval_failed, cl_eval_incomplete, success/reject). The
bump is additive. New fields are present only when use_cl_primary == True:
  decision_signal, baseline_closed_loop_per_example,
  evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks,
  cl_tasks_gained, cl_required_gain, synthetic_sanity_check,
  evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model.

When preflight was skipped (--no-saturation-check), records
reason_synthetic: 'preflight_skipped' so downstream consumers can
distinguish 'preflight saw no weak_signal' from 'preflight didn't run.'

cl_required_gain and synthetic_sanity_check reuse the
CL_PRIMARY_GROWTH_SLOPE / CL_PRIMARY_GROWTH_FREE_THRESHOLD /
CL_PRIMARY_SYNTH_TOLERANCE constants from quality_gate.py so the
gate-decision payload can't drift from the actual gate logic.

Existing v4 consumers see byte-identical output for synthetic-mode
runs except the new decision_signal: 'synthetic' string.
10 tests covering the deploy-gate branch on saturation band:
  - weak_signal triggers evolved CL eval (force_run called post-GEPA)
  - healthy/no_headroom fall through to synthetic
  - --no-saturation-check records reason_synthetic in JSON
  - all v5 fields present + correct types
  - force_run failure writes aborted decision with diagnostics
  - evolved task abstention writes cl_eval_incomplete (not regression)
  - absolute_char_ceiling still enforced in CL-primary path

Mocks the synthetic dataset builder + closed-loop cache at the same
seams as test_evolve_tool_saturation_preflight.py; calls evolve()
directly (rather than via CliRunner) so each test can inspect
gate_decision.json at a pinned output_dir.
…, pin evolved_FAILED.json

Code-review feedback on the CL-aware gate test suite:

1. Test 1's force_run assertion was substring-based on str(call_args),
   which silently misses regressions where force_run is called twice
   or with extra kwargs. Tightened to assert_called_once_with.

2. Added test_uniform_failure_band_falls_through_to_synthetic_gate
   pinning the spec edge-case (uniform_failure -> synthetic path).
   Without it, expanding use_cl_primary to include uniform_failure
   would silently change behavior without a test failing.

3. Test 9 (cl_eval_incomplete) now asserts evolved_FAILED.json is
   written, mirroring Test 8's assertion on the cl_eval_failed
   abort path. Production writes the file on both abort paths.
Pin the v4 → v5 additivity contract: every v4 field must still exist
in v5 output, plus decision_signal (always) and the CL-specific fields
(when use_cl_primary fired). Future schema bumps should add a
TestSchemaV{N}Regression class following this pattern.
…e sites

Final-review feedback caught two seam leaks:

1. write_cost_ceiling_abort was hard-coding schema_version=4. If the
   cost ceiling trips during the CL-primary force_run call, the
   resulting gate_decision.json had v4 in a v5 directory. Made the
   schema_version a keyword arg (default 4 for skill-side callers
   that haven't bumped yet); tool-side passes 5.

2. The static_constraint_failure payload was bumped to v5 in Task 4
   but never had decision_signal added. Every other v5 path has it.
   Set to 'synthetic' since static-fail fires before any CL eval.

3. Extended TestSchemaV5Regression with abort-path coverage so the
   above issues couldn't have slipped through. Three new tests pin
   schema_version and decision_signal on cl_eval_failed,
   cl_eval_incomplete, and static_constraint_failure payloads.

4. Renamed test_accepts_at_pr_68_calibration_point to
   test_accepts_at_24char_baseline_calibration_point per the project
   convention against exposing internal PR numbers in code.
@jramos jramos changed the title feat: deploy-gate CL-awareness (closes the PR #68 deploy-gap finding) feat: deploy-gate CL-awareness May 23, 2026
@jramos jramos merged commit 2ea480e into main May 23, 2026
7 of 8 checks passed
@jramos jramos deleted the feat/deploy-gate-cl-awareness branch May 23, 2026 19:24
jramos added a commit that referenced this pull request May 24, 2026
)

* refactor(evolve_skill): preserve SaturationReport fields for deploy gate

Symmetric to evolve_tool's CL-aware gate work. Today only
sat_report.holdout_per_example survives past the preflight call site;
the CL-aware deploy gate work needs band + cl_per_example + the two
trigger scores too. Bind four new locals; all default to None on the
--no-saturation-check path so the deploy gate can branch safely.

No behavior change; existing tests pass unchanged.

* feat(evolve_skill): branch deploy gate on saturation band

Symmetric to evolve_tool's CL-aware gate (already on main). When
preflight reports weak_signal AND closed-loop is configured, run
force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate.

Three abort paths writing schema-v5 gate_decision.json:
  - cl_eval_failed: force_run raised
  - cl_eval_incomplete: evolved CL task abstained (runner errored —
    distinguished from genuine task failure via TaskResult.abstained)
  - cl_primary_gate reject: returned by the gate helper itself

_check_absolute_chars preserved in CL-primary path; all other bands
fall through to today's synthetic path unchanged.

Key skill-specific differences from tool-side:
- force_run called with evolved_body (no YAML frontmatter) to match
  the cache key the preflight set up with baseline body.
- Abort path writes evolved_FAILED.md (matching existing skill reject
  convention), not evolved_FAILED.json.
- growth_pct uses evolved_full vs skill["raw"] (existing convention).

run_inputs literal hoisted to a local since the same dict now appears
in 3 places (success + 2 abort paths).

* feat(evolve_skill): gate_decision.json schema v5 with CL-primary fields

Schema bumps from v4 to v5 (additive). New fields:
  decision_signal (always), baseline_closed_loop_per_example,
  evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks,
  cl_tasks_gained, cl_required_gain, synthetic_sanity_check,
  evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model
  (only when use_cl_primary fired).

reason_synthetic: 'preflight_skipped' when --no-saturation-check was
passed, so downstream consumers can distinguish 'preflight saw no
weak_signal' from 'preflight didn't run.'

Existing v4 consumers see byte-identical output for synthetic-mode
skill runs aside from the new decision_signal string. v4-specific
skill fields (bap_max_growth, bap_safety_margin, eval_source,
fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score)
all preserved.

Lockstep with evolve_tool: both at schema v5 after this lands.

* test(evolve_skill): integration + schema regression for CL-aware gate

13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py
plus 3 skill-specific guards:
  - force_run called with body, not full (frontmatter key bug guard)
  - evolved_FAILED.md (not .json) on abort
  - v4 skill payload fields (bap_*, eval_source, fitness_profile,
    proposer_mode, knee_point.band_roster) preserved in v5

Mocks via the existing test_evolve_skill_saturation_preflight pattern.
Calls evolve() directly rather than via CliRunner (matches the pattern
PR #69 settled on after its own CliRunner deviation).

* test(evolve_skill): pin CL-primary path in absolute-char-ceiling test

Code review flagged test 10 as misnaming/misasserting the path under
test (claiming static gate at evolve_skill.py:1034 fires first with
decision_signal='synthetic'). Empirical verification shows
validate_static at line 1034 only runs size_limit/non_empty/skill_structure
— the absolute_char_ceiling check lives at line 1271 inside the
use_cl_primary branch and runs AFTER force_run. Rejection carries
decision_signal='closed_loop'.

Keep the original (accurate) test name, expand the docstring to spell
out the verified production flow, and add two assertions that pin the
real CL-primary path: decision_signal == 'closed_loop' and
force_run.assert_called_once_with(body).

Also fix a minor miscount in the module docstring frontmatter math
(was '43 chars', actually 42).

* test: weakened systematic-debugging fixture for CL-aware-gate smoke

Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl
at seed 42 with openai/gpt-5-mini validator. Mirrors
tests/fixtures/tool_manifests/weakened_write_file_manifest.json from
the tool-side work — checked in so the post-merge manual smoke for the
CL-aware skill deploy gate is reproducible.

The fixture is a misdirection variant ("Python Bug Diagnostician") that
looks plausible to the LLM judge (synthetic holdout ~0.95) while
instructing the agent to write a diagnostic report instead of editing
the file, so the planted-bug closed-loop suite scores in the
0.15-0.95 band rather than saturating.

Empirical probe results (seed 42, eval-model gpt-4.1-mini,
closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50):
  Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5)
  Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5)
  Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5)

CL pass-rate varies run-to-run because the 5-task suite has 0.2
granularity and gpt-5-mini's adherence to the "don't edit" framing is
non-deterministic, but every probe landed in the weak_signal band,
which is what the manual smoke needs to exercise the CL-primary gate
path.

Earlier probes confirmed the binary-tier saturation trap from prior
tool-side work:
  - Passive weakenings (v1: stripped methodology, v3: one-sentence
    skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and
    uniform_failure with gpt-5-nano (CL=0.0). The validator model
    capacity, not the skill content, drives the closed-loop signal on
    these planted bugs.
  - Only active misdirection (this fixture) creates the synthetic /
    closed-loop disagreement the weak_signal band requires.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant