feat: closed-loop validation follow-ups (agent model override + advanced bug suite)#63
Merged
Conversation
…unner Adds an optional ``model`` kwarg to HermesAgentRunner. When set, the subprocess invocation becomes ``hermes -m MODEL -z ...`` so the agent runs against a deliberately weaker model than the user's daily-driver Hermes default, without mutating ~/.hermes/config.yaml. Surfaced by the manual validation run on the systematic-debugging suite: a capable reasoning model solved all 5 textbook planted bugs on the deliberately-weak baseline skill, leaving the closed-loop verdict at 5/5 across the board and hiding the behavioral signal the validator is supposed to surface. Swapping the model via ``hermes config set`` collapsed the structured ``model:`` dict (provider/base_url/api_key) into a bare string and broke the user's custom-endpoint config, so per-invocation override via the CLI is the cleaner seam. Plumbed through the existing cache helpers on both paths: evolve_skill: --closed-loop-agent-model → _maybe_build_closed_loop_cache_skill evolve_tool: --closed-loop-agent-model → _maybe_build_closed_loop_cache The flag is unset by default — existing behavior is bit-for-bit preserved when callers don't opt in. ``-m MODEL`` is inserted before ``-z`` so hermes parses it as a global flag, not as part of the prompt.
The textbook 5-bug suite at systematic_debugging.jsonl saturates at 5/5
against capable reasoning models — every bug is the kind of thing the
agent solves on first try regardless of what the skill text says. That
zeroes out the behavioral signal closed-loop validation is supposed to
surface.
New suite at systematic_debugging_advanced.jsonl ships 5 bugs that
require structured debugging because the obvious patch only fixes the
failing test, leaving the spec's edge cases broken. Each exercises a
different cognitive failure mode:
- debug_generator_exhaustion: function iterates input twice; works
on lists, breaks on single-pass generators
- debug_shared_mutable_return: cached default returns a live ref;
caller mutation leaks into the cache
- debug_float_precision_equality: == on a computed sum; fix requires
math.isclose with documented tolerance
- debug_binary_search_boundary: bisect_right code masquerading as
bisect_left; middle insertion works, leftmost-of-equals fails
- debug_class_vs_instance_attribute: class-level mutable default;
instances share state instead of being independent
Sanity tests prove every planted bug is real (baseline test fails) and
every documented fix passes, plus a regression guard that task_ids
don't collide with the basic suite.
Smoke harness grows --suite {basic|advanced} and --agent-model flags so
both wiring sanity and headroom validation are driven by the same
script. Docs updated with the two-knob guidance (suite + agent model)
for recovering discrimination when the default Hermes model saturates.
…ion finding Quality-of-life knob plumbed through both evolve_skill and evolve_tool: override HermesAgentRunner's per-task wall-clock budget when the chosen agent model is slow. Reasoning models other than the smallest typically take 200-300s per debugging task; the default 120s causes them to abstain (timeout) silently rather than producing a verdict. Manual validation finding: both planted-bug suites (basic + advanced) saturate at 5/5 against capable reasoning models (gpt-5.4-mini on this setup), and the only confirmed-compatible weaker reasoning model (o3-mini) is slow enough to abstain most tasks at the default timeout. The framework's wiring is correct end-to-end, but Python-debugging-of- textbook-bugs may not discriminate skill-text variants on capable models regardless of the suite or model swap. Documented in docs/workflows.md alongside the suggestion to evaluate on surfaces where methodology matters more than recognition (multi-file refactoring, ambiguous-spec tasks, iterative hypothesis-testing scenarios). Smoke harness grows --task-timeout-seconds to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Direct follow-ups to the manual validation finding from #62: a capable reasoning model solved every textbook planted bug regardless of skill text, leaving the closed-loop verdict at 5/5 across the board and hiding the behavioral signal. Three coordinated changes to recover discrimination, plus an honest accounting of what they did and didn't achieve.
Commits:
--closed-loop-agent-modeloverride onHermesAgentRunner— when set, the validator's subprocess invocation becomeshermes -m MODEL -z ..., running the agent against a deliberately weaker model than the user's daily-driver Hermes default without mutating~/.hermes/config.yaml. Plumbed through bothevolve_skillandevolve_toolcache helpers + CLI flags. Default unset → existing behavior bit-for-bit preserved.systematic_debugging_advanced.jsonl— 5 harder planted bugs designed to discriminate skill-text variants on capable agents. Each exercises a different cognitive failure mode (iteration, state, precision, algorithm, OOP semantics) and breaks edge cases the obvious patch misses. Sanity tests verify every planted bug is real and every documented fix passes, plus a regression guard against task_id collisions with the basic suite.--closed-loop-task-timeout-secondsknob — bump the per-task wall-clock budget for slow reasoning models (o1/o3-family typically need 200-300s/task). Surfaced by validation: even when the override flag works mechanically, the default 120s budget abstains most tasks before they finish.Plus: the manual smoke harness at
tests/manual/skill_closed_loop_smoke.pygrows--suite {basic|advanced},--agent-model MODEL, and--task-timeout-seconds Nflags.docs/workflows.mddocuments all three knobs along with the empirical caveat below.Test coverage: 1029 passing (was 1010; +19 new — 8 advanced-suite sanity + 11 plumbing/CLI tests across both paths). All mock-only in CI.
Manual validation outcome
Honest accounting of what was tested with real
hermes -zsubprocesses against the user's actual setup (gpt-5.4-mini default):--agent-model o3-miniWhat this PR ships ≠ what it validates. The
--closed-loop-agent-modeland--closed-loop-task-timeout-secondsflags are plumbed correctly and the advanced suite is a genuinely harder regression check. But on this user's setup (capable reasoning model default + only o3-mini as a compatible weaker model + 120s default timeout), the closed-loop signal saturates either at 5/5 (capable model) or at abstention (slow model). No "improvement caught" or "regression caught" outcome was produced; only saturation.The honest interpretation. For this user, Python-debugging-of-textbook-bugs may not be a discriminating evaluation surface for the closed-loop signal regardless of which knob is turned. The flags are useful for future setups (a user with a faster weaker model would get real signal, and the timeout knob lets o3-mini-class models actually finish). The framework's central hypothesis (behavioral wins break judge ties on saturated baselines) remains untested end-to-end and likely needs a different evaluation domain to demonstrate — multi-file refactoring, ambiguous specs with edge cases, iterative hypothesis-testing tasks where methodology matters more than recognition.
Test plan
uv run pytest tests/ -q --ignore=tests/manual— 1029 passed--agent-model o3-mini— flag works; timeout exposes need for--closed-loop-task-timeout-secondsevolve_skillagainst a setup that produces non-saturated baseline — requires a different evaluation surface, out of scope for this PR