cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel by jramos · Pull Request #71 · jramos/agent-self-evolution

jramos · 2026-05-24T04:38:30Z

Summary

Closes three follow-ups from the deploy-gate CL-awareness arc:

1. Extract build_run_inputs to evolution/core/run_inputs.py. The run_inputs dict literal had drifted across five call sites in evolve_skill.py and evolve_tool.py. Single helper, conditional inclusion of tool-only kwargs (fitness_profile, enable_confusable_bucket) preserves the skill-side 10-key shape and the tool-side 12-key shape. As a structural side effect, this fixes two pre-existing inconsistencies:

Tool deploy-path was missing eval_source — silently divergent from the skill side.
Tool cost-ceiling fallback was missing enable_confusable_bucket — a latent bug, since tests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeploy asserts the field is present. Anyone who hit the cost ceiling on a deploy path was writing a gate_decision.json that would have failed schema validation.

2. Extract append_cl_decision_fields to evolution/core/quality_gate.py. The 9-field mutation block that augments decision_payload when use_cl_primary=True was byte-identical between the two evolve modules. Sits next to _check_cl_primary_gate and the CL_PRIMARY_* constants, matching the established pattern. Removes the now-unused math import and the three CL_PRIMARY_* constant imports from both evolve modules.

3. Surface CL gains in the summary panel when CL-primary deployed. Before this PR, the run-end summary printed ⚠ Evolution did not improve … even when the CL-aware gate had just deployed an artifact whose closed-loop signal gained tasks on a saturated synthetic baseline — the exact scenario the new gate is designed to ship. The panel now consults decision_payload["decision_signal"] and renders:

CL-primary deploy: ✓ Evolution improved {tool description|skill} (CL gained +N tasks) in green
CL-primary reject: ⚠ Evolution rejected: CL gain N < required M in yellow
Synthetic path: existing behavior preserved (regression-tested)

The Evolution Results table gets a new Closed-loop (behavioral) row when use_cl_primary, and the row-color logic switches from raw synthetic delta to the gate's actual verdict (growth_pass) under CL-primary.

Two extractions deliberately deferred (different scaffold, not clean wins):

build_cl_constraint() setup helper — well-isolated 6-line block, low maintenance cost today
write_cl_abort_decision() factory — the abort-path dicts (cl_eval_failed, cl_eval_incomplete) are flat-dict literals with different field sets, not in-place mutations

No behavior change to the gate itself.

Test plan

env -i HOME="\$HOME" PATH="\$PATH" OPENAI_API_KEY="sk-fake-test-key" uv run pytest -q — full suite green (1144 passed locally)
tests/core/test_run_inputs.py — 4 new helper tests
tests/core/test_append_cl_decision_fields.py — 4 new helper tests including constant-drift regression and synth-tolerance boundary
tests/skills/test_evolve_skill_validation_flow.py::TestRunInputs — schema contract preserved (skill 10-key)
tests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeploy — schema contract preserved (enable_confusable_bucket is True)
tests/{tools,skills}/test_evolve_*_cl_aware_gate.py — 35 passes confirm byte-for-byte JSON output preserved across both extractions; +6 new console-output tests cover summary-panel CL-awareness on deploy, reject, and the synthetic-path regression
CI green across all Python versions

Five sites built the same `run_inputs` dict by hand (3 skill, 2 tool). Two latent inconsistencies fall out of normalizing through one helper: - evolve_tool's deploy-path literal was missing `eval_source`, which the skill side has always included. The helper restores it. - evolve_tool's cost-ceiling fallback was missing `enable_confusable_bucket`, which the deploy-path schema test asserts is present. Whenever the cost ceiling tripped on the deploy path, the fallback fired and silently dropped the field; routing through the helper closes that gap. The helper's `quality_gate_preset`/`fitness_profile`/`enable_confusable_bucket` kwargs make both the shape contract and the optional-on-skill / required-on-tool asymmetry explicit at every call site.

Address review feedback on 47cabe8: - Drop historical narrative from helper and test module docstrings; keep the load-bearing "what does this produce" sentence. - Downgrade test_enable_confusable_bucket_round_trips docstring to honestly describe what the test covers (helper round-trip only, not the cost-ceiling call site). - Remove redundant len() assertions in shape tests (the set equality already pins the count) and rename test_skill_side_has_nine_keys. - Remove stale "hoist run_inputs to a local" comment; the helper call makes the intent obvious.

…_gate The same 9-field CL-decision mutation block was duplicated byte-for-byte in evolve_tool.py and evolve_skill.py. Move it to a shared helper so the deploy-path callsites collapse to one call each and the CL_PRIMARY_* / math.ceil details live in one place. Scope is the success-path block only. The cl_eval_failed / cl_eval_incomplete abort callsites also write CL-related fields, but with a different shape (flat dict literal, no synthetic_sanity_check, no cl_tasks_gained / cl_required_gain) — forcing them through this helper would either inflate the abort payload or require ad-hoc kwargs that obscure the contract. Left as a follow-up. evolved_cl_errored_task_ids defaults to () so the deploy-path caller stays empty by construction; abort-path adoption can pass the populated list without diverging the signature.

Address review feedback on 5edd73d: - Drop evolved_cl_errored_task_ids kwarg and its Sequence import. The deferred abort paths have a different scaffold and won't adopt this helper; the kwarg + its forward-looking docstring were YAGNI. - Hard-code evolved_closed_loop_errored_tasks = [] in the helper body (matches deploy-path semantics: success path has no errors by construction). - Trim docstring to one line. - Rename test_errored_task_ids_defaults_to_empty_list to test_errored_tasks_is_empty_list (no longer about a default).

The panel rendered "did not improve" on a CL-primary deploy because the synthetic delta is often near zero (or negative) when the closed-loop gate is the one driving the decision. Now under use_cl_primary the deploy line announces the CL task gain and the reject line names the shortfall vs the required gain; the Evolution Results table picks up a "Closed-loop (behavioral)" row and colors all rows by the gate verdict rather than the irrelevant synthetic delta.

jramos added 5 commits May 23, 2026 21:56

jramos merged commit 8d87a3c into main May 24, 2026
4 checks passed

jramos deleted the cleanup/run-inputs-and-summary-panel branch May 24, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel#71

cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel#71
jramos merged 5 commits into
mainfrom
cleanup/run-inputs-and-summary-panel

jramos commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 24, 2026 •

edited

Loading