cleanup: extract run_inputs + CL decision fields helpers; surface CL gains in summary panel#71
Merged
Merged
Conversation
Five sites built the same `run_inputs` dict by hand (3 skill, 2 tool). Two latent inconsistencies fall out of normalizing through one helper: - evolve_tool's deploy-path literal was missing `eval_source`, which the skill side has always included. The helper restores it. - evolve_tool's cost-ceiling fallback was missing `enable_confusable_bucket`, which the deploy-path schema test asserts is present. Whenever the cost ceiling tripped on the deploy path, the fallback fired and silently dropped the field; routing through the helper closes that gap. The helper's `quality_gate_preset`/`fitness_profile`/`enable_confusable_bucket` kwargs make both the shape contract and the optional-on-skill / required-on-tool asymmetry explicit at every call site.
Address review feedback on 47cabe8: - Drop historical narrative from helper and test module docstrings; keep the load-bearing "what does this produce" sentence. - Downgrade test_enable_confusable_bucket_round_trips docstring to honestly describe what the test covers (helper round-trip only, not the cost-ceiling call site). - Remove redundant len() assertions in shape tests (the set equality already pins the count) and rename test_skill_side_has_nine_keys. - Remove stale "hoist run_inputs to a local" comment; the helper call makes the intent obvious.
…_gate The same 9-field CL-decision mutation block was duplicated byte-for-byte in evolve_tool.py and evolve_skill.py. Move it to a shared helper so the deploy-path callsites collapse to one call each and the CL_PRIMARY_* / math.ceil details live in one place. Scope is the success-path block only. The cl_eval_failed / cl_eval_incomplete abort callsites also write CL-related fields, but with a different shape (flat dict literal, no synthetic_sanity_check, no cl_tasks_gained / cl_required_gain) — forcing them through this helper would either inflate the abort payload or require ad-hoc kwargs that obscure the contract. Left as a follow-up. evolved_cl_errored_task_ids defaults to () so the deploy-path caller stays empty by construction; abort-path adoption can pass the populated list without diverging the signature.
Address review feedback on 5edd73d: - Drop evolved_cl_errored_task_ids kwarg and its Sequence import. The deferred abort paths have a different scaffold and won't adopt this helper; the kwarg + its forward-looking docstring were YAGNI. - Hard-code evolved_closed_loop_errored_tasks = [] in the helper body (matches deploy-path semantics: success path has no errors by construction). - Trim docstring to one line. - Rename test_errored_task_ids_defaults_to_empty_list to test_errored_tasks_is_empty_list (no longer about a default).
The panel rendered "did not improve" on a CL-primary deploy because the synthetic delta is often near zero (or negative) when the closed-loop gate is the one driving the decision. Now under use_cl_primary the deploy line announces the CL task gain and the reject line names the shortfall vs the required gain; the Evolution Results table picks up a "Closed-loop (behavioral)" row and colors all rows by the gate verdict rather than the irrelevant synthetic delta.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes three follow-ups from the deploy-gate CL-awareness arc:
1. Extract
build_run_inputstoevolution/core/run_inputs.py. Therun_inputsdict literal had drifted across five call sites inevolve_skill.pyandevolve_tool.py. Single helper, conditional inclusion of tool-only kwargs (fitness_profile,enable_confusable_bucket) preserves the skill-side 10-key shape and the tool-side 12-key shape. As a structural side effect, this fixes two pre-existing inconsistencies:eval_source— silently divergent from the skill side.enable_confusable_bucket— a latent bug, sincetests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeployasserts the field is present. Anyone who hit the cost ceiling on a deploy path was writing agate_decision.jsonthat would have failed schema validation.2. Extract
append_cl_decision_fieldstoevolution/core/quality_gate.py. The 9-field mutation block that augmentsdecision_payloadwhenuse_cl_primary=Truewas byte-identical between the two evolve modules. Sits next to_check_cl_primary_gateand theCL_PRIMARY_*constants, matching the established pattern. Removes the now-unusedmathimport and the threeCL_PRIMARY_*constant imports from both evolve modules.3. Surface CL gains in the summary panel when CL-primary deployed. Before this PR, the run-end summary printed
⚠ Evolution did not improve …even when the CL-aware gate had just deployed an artifact whose closed-loop signal gained tasks on a saturated synthetic baseline — the exact scenario the new gate is designed to ship. The panel now consultsdecision_payload["decision_signal"]and renders:✓ Evolution improved {tool description|skill} (CL gained +N tasks)in green⚠ Evolution rejected: CL gain N < required Min yellowThe Evolution Results table gets a new
Closed-loop (behavioral)row whenuse_cl_primary, and the row-color logic switches from raw synthetic delta to the gate's actual verdict (growth_pass) under CL-primary.Two extractions deliberately deferred (different scaffold, not clean wins):
build_cl_constraint()setup helper — well-isolated 6-line block, low maintenance cost todaywrite_cl_abort_decision()factory — the abort-path dicts (cl_eval_failed,cl_eval_incomplete) are flat-dict literals with different field sets, not in-place mutationsNo behavior change to the gate itself.
Test plan
env -i HOME="\$HOME" PATH="\$PATH" OPENAI_API_KEY="sk-fake-test-key" uv run pytest -q— full suite green (1144 passed locally)tests/core/test_run_inputs.py— 4 new helper teststests/core/test_append_cl_decision_fields.py— 4 new helper tests including constant-drift regression and synth-tolerance boundarytests/skills/test_evolve_skill_validation_flow.py::TestRunInputs— schema contract preserved (skill 10-key)tests/tools/test_evolve_tool_validation_flow.py::TestGateDecisionSchemaOnDeploy— schema contract preserved (enable_confusable_bucket is True)tests/{tools,skills}/test_evolve_*_cl_aware_gate.py— 35 passes confirm byte-for-byte JSON output preserved across both extractions; +6 new console-output tests cover summary-panel CL-awareness on deploy, reject, and the synthetic-path regression