Skip to content

cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel#71

Merged
jramos merged 5 commits into
mainfrom
cleanup/run-inputs-and-summary-panel
May 24, 2026
Merged

cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel#71
jramos merged 5 commits into
mainfrom
cleanup/run-inputs-and-summary-panel

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 24, 2026

Summary

Closes three follow-ups from the deploy-gate CL-awareness arc:

1. Extract build_run_inputs to evolution/core/run_inputs.py. The run_inputs dict literal had drifted across five call sites in evolve_skill.py and evolve_tool.py. Single helper, conditional inclusion of tool-only kwargs (fitness_profile, enable_confusable_bucket) preserves the skill-side 10-key shape and the tool-side 12-key shape. As a structural side effect, this fixes two pre-existing inconsistencies:

  • Tool deploy-path was missing eval_source — silently divergent from the skill side.
  • Tool cost-ceiling fallback was missing enable_confusable_bucket — a latent bug, since tests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeploy asserts the field is present. Anyone who hit the cost ceiling on a deploy path was writing a gate_decision.json that would have failed schema validation.

2. Extract append_cl_decision_fields to evolution/core/quality_gate.py. The 9-field mutation block that augments decision_payload when use_cl_primary=True was byte-identical between the two evolve modules. Sits next to _check_cl_primary_gate and the CL_PRIMARY_* constants, matching the established pattern. Removes the now-unused math import and the three CL_PRIMARY_* constant imports from both evolve modules.

3. Surface CL gains in the summary panel when CL-primary deployed. Before this PR, the run-end summary printed ⚠ Evolution did not improve … even when the CL-aware gate had just deployed an artifact whose closed-loop signal gained tasks on a saturated synthetic baseline — the exact scenario the new gate is designed to ship. The panel now consults decision_payload["decision_signal"] and renders:

  • CL-primary deploy: ✓ Evolution improved {tool description|skill} (CL gained +N tasks) in green
  • CL-primary reject: ⚠ Evolution rejected: CL gain N < required M in yellow
  • Synthetic path: existing behavior preserved (regression-tested)

The Evolution Results table gets a new Closed-loop (behavioral) row when use_cl_primary, and the row-color logic switches from raw synthetic delta to the gate's actual verdict (growth_pass) under CL-primary.

Two extractions deliberately deferred (different scaffold, not clean wins):

  • build_cl_constraint() setup helper — well-isolated 6-line block, low maintenance cost today
  • write_cl_abort_decision() factory — the abort-path dicts (cl_eval_failed, cl_eval_incomplete) are flat-dict literals with different field sets, not in-place mutations

No behavior change to the gate itself.

Test plan

  • env -i HOME="\$HOME" PATH="\$PATH" OPENAI_API_KEY="sk-fake-test-key" uv run pytest -q — full suite green (1144 passed locally)
  • tests/core/test_run_inputs.py — 4 new helper tests
  • tests/core/test_append_cl_decision_fields.py — 4 new helper tests including constant-drift regression and synth-tolerance boundary
  • tests/skills/test_evolve_skill_validation_flow.py::TestRunInputs — schema contract preserved (skill 10-key)
  • tests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeploy — schema contract preserved (enable_confusable_bucket is True)
  • tests/{tools,skills}/test_evolve_*_cl_aware_gate.py — 35 passes confirm byte-for-byte JSON output preserved across both extractions; +6 new console-output tests cover summary-panel CL-awareness on deploy, reject, and the synthetic-path regression
  • CI green across all Python versions

jramos added 5 commits May 23, 2026 21:56
Five sites built the same `run_inputs` dict by hand (3 skill, 2 tool).
Two latent inconsistencies fall out of normalizing through one helper:

- evolve_tool's deploy-path literal was missing `eval_source`, which
  the skill side has always included. The helper restores it.
- evolve_tool's cost-ceiling fallback was missing
  `enable_confusable_bucket`, which the deploy-path schema test
  asserts is present. Whenever the cost ceiling tripped on the
  deploy path, the fallback fired and silently dropped the field;
  routing through the helper closes that gap.

The helper's `quality_gate_preset`/`fitness_profile`/`enable_confusable_bucket`
kwargs make both the shape contract and the optional-on-skill / required-on-tool
asymmetry explicit at every call site.
Address review feedback on 47cabe8:
- Drop historical narrative from helper and test module docstrings;
  keep the load-bearing "what does this produce" sentence.
- Downgrade test_enable_confusable_bucket_round_trips docstring to
  honestly describe what the test covers (helper round-trip only,
  not the cost-ceiling call site).
- Remove redundant len() assertions in shape tests (the set equality
  already pins the count) and rename test_skill_side_has_nine_keys.
- Remove stale "hoist run_inputs to a local" comment; the helper
  call makes the intent obvious.
…_gate

The same 9-field CL-decision mutation block was duplicated byte-for-byte in
evolve_tool.py and evolve_skill.py. Move it to a shared helper so the
deploy-path callsites collapse to one call each and the CL_PRIMARY_* /
math.ceil details live in one place.

Scope is the success-path block only. The cl_eval_failed / cl_eval_incomplete
abort callsites also write CL-related fields, but with a different shape
(flat dict literal, no synthetic_sanity_check, no cl_tasks_gained /
cl_required_gain) — forcing them through this helper would either inflate
the abort payload or require ad-hoc kwargs that obscure the contract. Left
as a follow-up.

evolved_cl_errored_task_ids defaults to () so the deploy-path caller stays
empty by construction; abort-path adoption can pass the populated list
without diverging the signature.
Address review feedback on 5edd73d:
- Drop evolved_cl_errored_task_ids kwarg and its Sequence import.
  The deferred abort paths have a different scaffold and won't adopt
  this helper; the kwarg + its forward-looking docstring were YAGNI.
- Hard-code evolved_closed_loop_errored_tasks = [] in the helper body
  (matches deploy-path semantics: success path has no errors by
  construction).
- Trim docstring to one line.
- Rename test_errored_task_ids_defaults_to_empty_list to
  test_errored_tasks_is_empty_list (no longer about a default).
The panel rendered "did not improve" on a CL-primary deploy because the
synthetic delta is often near zero (or negative) when the closed-loop
gate is the one driving the decision. Now under use_cl_primary the
deploy line announces the CL task gain and the reject line names the
shortfall vs the required gain; the Evolution Results table picks up a
"Closed-loop (behavioral)" row and colors all rows by the gate verdict
rather than the irrelevant synthetic delta.
@jramos jramos merged commit 8d87a3c into main May 24, 2026
4 checks passed
@jramos jramos deleted the cleanup/run-inputs-and-summary-panel branch May 24, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant