Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ The agent can self-invoke optimization:

1. **Benchmark gating (TBLite/YC-Bench) was not wired as a built-in.** The plan made "TBLite within 2%" the next-phase gate. The framework went framework-agnostic (any agent that emits `SKILL.md`), so the Hermes-specific benchmark suites no longer apply uniformly. The built-in deploy gate is a paired-bootstrap CI on a held-out split of the skill's own eval set (`evolution/core/stats.py` + `_check_growth_with_quality_gate`). For users who want a benchmark-style gate, `--benchmark-cmd "<shell command>"` runs an arbitrary user-provided command after the built-in gate passes; nonzero exit flips the decision to reject with `reason="benchmark_failed"` and a `benchmark` block in `gate_decision.json`. This restores the plan author's "benchmarks as gates" vision in a framework-agnostic way: TBLite users wire `--benchmark-cmd "tblite-runner --threshold 0.02"`, no built-in coupling.
2. **Eval datasets are larger than the planned 15–30 examples.** `eval_dataset_size=150` is the current default, sized so the bootstrap CI half-width can detect ±2% effects. Rationale and the supporting study live in `reports/calibration_findings.md`.
3. **Source D (skill-specific auto-evaluation) is now built for tool descriptions; skill-side auto-eval still deferred.** `evolution/validation/closed_loop` drives a real Hermes Agent via `hermes -z` against a JSONL task suite, scores actual tool-selection behavior with both expected and forbidden tools, compares baseline vs. evolved with a two-condition decision rule (no aggregate regression + no per-task loss unless wins offset 2:1). v1 ships a 5-task suite for `patch`. Skill-side equivalents (planted-bug for `systematic-debugging`, known-paper recall for `arxiv`, planted-issue for `github-code-review`) remain deferred — same harness shape, needs different `ArtifactInstaller` impls.
3. **Source D (skill-specific auto-evaluation) is built for tool descriptions and for `systematic-debugging`; remaining skill-side suites still deferred.** `evolution/validation/closed_loop` drives a real Hermes Agent via `hermes -z` against a JSONL task suite, scores actual tool-selection behavior with both expected and forbidden tools, compares baseline vs. evolved with a two-condition decision rule (no aggregate regression + no per-task loss unless wins offset 2:1). v1 ships a 5-task suite for `patch` (tool-side). Skill-side now ships `systematic-debugging` via a new `SkillFileInstaller` plus a `test_command` verdict mode (exit code of a planted test replaces tool-call membership scoring); same harness, different artifact + scoring shape. Remaining skill-side equivalents (known-paper recall for `arxiv`, planted-issue for `github-code-review`) remain deferred — same harness, different verdict mechanisms (rubric-based for arxiv, planted-issue identification for code-review).
4. **The selection and gating layer is much thicker than the plan anticipated.** Knee-point Pareto selection, a budget-aware reflection-prompt proposer with growth/balanced/compression modes, paired-bootstrap CI, non-inferiority gate option, and the `gate_decision.json` v4 audit schema were added on top of raw GEPA. Rationale: raw "ship the best valset candidate" overfits at small N. See `docs/framework_advantages.md`.
5. **CLI shipped as `python -m evolution.skills.evolve_skill`, not a `hermes evolve skill` subcommand.** The framework rebranded out of Hermes-only branding so it could target any agent framework whose skills are `SKILL.md` files.
6. **`--run-tests` (target-repo pytest as a hard gate) was specified but never wired and has been removed.** Skill text doesn't touch target-repo Python, so coupling the deploy decision to the target repo's test suite would gate on unrelated signal. The CLI flag set a config field that no code read. The unused `EvolutionConfig.run_pytest`, `run_tblite`, `tblite_regression_threshold` fields and `ConstraintValidator.run_test_suite()` method were removed alongside the flag.
Expand Down
11 changes: 11 additions & 0 deletions docs/workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,17 @@ Per-task scores are deterministic over candidate text (cache is keyed by `sha256

`--closed-loop-mode both` does both: trainset behavioral examples for acceptance, plus the `[CLOSED_LOOP]` feedback block on non-behavioral examples for reflection.

### Skill-path equivalent

`evolve_skill` exposes the same `--closed-loop-*` flags. Two differences from the tool path:

- **Verdict mechanism is `test_command`, not tool-call membership.** Skill-side suites set `"test_command": "python test_solution.py"` on each task; the validator runs that command in `fixture_dir` after the agent and passes iff exit code is zero. `expected_tools` / `forbidden_tools` aren't meaningful for "did the agent debug correctly"-shape verdicts. The decision rule (two-condition: aggregate no-regression + per-task wins offset losses 2:1) is unchanged.
- **`SkillFileInstaller` instead of in-place tool description splice.** The user's actual skill may live in a read-only plugin cache, so the installer copies the baseline skill directory into a writable workdir at construction and mutates the copy. `HermesAgentRunner._prime_sandbox` reads `TaskRunContext.skills_src` and copies that workdir's `skills/` into each per-task sandbox so `hermes -z` discovers the candidate. The user's source skill is never touched.

Default `--closed-loop-mode` is `feedback` (not `trainset`) on the skill side. Skill bodies mutate heavily, so the `gate_mode="always"` that trainset needs would fire the validator on every novel candidate — N tasks × 2 phases per fire. Opt into `trainset` / `both` explicitly when the cost is acceptable.

Reference suite: `evolution/validation/suites/systematic_debugging.jsonl` (5 planted-bug tasks). Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py`.

## Failure-mode summary

| Trigger | Outcome | Where to look |
Expand Down
25 changes: 13 additions & 12 deletions evolution/core/behavioral_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@
and get accepted.

The example carries a ``closed_loop_task_id`` marker the metric routes on,
plus a placeholder ``task`` value (the suite's ``user_message``) that
``ToolModule.forward`` skips past on the behavioral branch. ``task`` and
``closed_loop_task_id`` are both marked as input keys so DSPy passes them
to ``forward()`` via ``program(**example.inputs())``.
plus a placeholder task-input value (the suite's ``user_message``) that the
module's ``forward()`` skips past on the behavioral branch. Both fields are
marked as input keys so DSPy passes them via ``program(**example.inputs())``.

``task_field`` parameterizes the input field name to match the host module's
forward signature: ``ToolModule.forward(task=...)`` uses ``"task"`` (the
default); ``SkillModule.forward(task_input=...)`` passes ``"task_input"``.
"""

from __future__ import annotations
Expand All @@ -20,17 +23,15 @@
from evolution.validation.task import TaskSuite


def build_behavioral_examples(suite: TaskSuite) -> list[dspy.Example]:
"""One example per task in ``suite``, stable order by ``task_id``.

The placeholder ``task`` value carries the original ``user_message`` for
debuggability; it isn't consumed by the behavioral metric branch.
"""
def build_behavioral_examples(
suite: TaskSuite, *, task_field: str = "task"
) -> list[dspy.Example]:
"""One example per task in ``suite``, stable order by ``task_id``."""
examples = [
dspy.Example(
task=task.user_message,
**{task_field: task.user_message},
closed_loop_task_id=task.task_id,
).with_inputs("task", "closed_loop_task_id")
).with_inputs(task_field, "closed_loop_task_id")
for task in sorted(suite.tasks, key=lambda t: t.task_id)
]
return examples
62 changes: 49 additions & 13 deletions evolution/core/closed_loop_feedback.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import tempfile
import threading
from pathlib import Path
from typing import Literal, Optional
from typing import Callable, Literal, Optional

from evolution.validation.report import TaskResult, ValidationReport
from evolution.validation.task import TaskSuite
Expand All @@ -42,6 +42,15 @@

GateMode = Literal["sampled", "always"]

ArtifactWriter = Callable[[str, Path], None]
"""Write candidate text to a path in the format the installer consumes.

The cache calls this with ``(baseline_or_candidate_text, target_path)``
before each validator run. The default writes a single-tool MCP manifest
JSON (the tool-side shape); skill-side passes a writer that drops raw
text directly into the path.
"""

logger = logging.getLogger(__name__)


Expand All @@ -67,12 +76,14 @@ def __init__(
*,
validator: ClosedLoopValidator,
suite: TaskSuite,
tool_name: str,
baseline_description: str,
artifact_name: str,
baseline_artifact_text: str,
saturation_threshold: float = 0.95,
min_iters: int = 3,
window_size: int = 8,
gate_mode: GateMode = "sampled",
artifact_writer: Optional[ArtifactWriter] = None,
artifact_suffix: str = ".json",
) -> None:
if not (0.0 <= saturation_threshold <= 1.0):
raise ValueError(
Expand All @@ -88,19 +99,23 @@ def __init__(
)
self._validator = validator
self._suite = suite
self._tool_name = tool_name
self._artifact_name = artifact_name
self.saturation_threshold = saturation_threshold
self.min_iters = min_iters
self.window_size = window_size
self.gate_mode = gate_mode

self._tmp_dir = Path(tempfile.mkdtemp(prefix="cl_feedback_"))
self._baseline_path = self._tmp_dir / "baseline.json"
self._evolved_path = self._tmp_dir / "evolved.json"
self._baseline_path.write_text(
_manifest_json(tool_name, baseline_description)
self._artifact_writer: ArtifactWriter = (
artifact_writer
if artifact_writer is not None
else _make_default_tool_writer(artifact_name)
)

self._tmp_dir = Path(tempfile.mkdtemp(prefix="cl_feedback_"))
self._baseline_path = self._tmp_dir / f"baseline{artifact_suffix}"
self._evolved_path = self._tmp_dir / f"evolved{artifact_suffix}"
self._artifact_writer(baseline_artifact_text, self._baseline_path)

self._cache: dict[str, ValidationReport] = {}
self._judge_history: list[float] = []
self._iters_since_last_run = self.min_iters # allow first fire
Expand Down Expand Up @@ -147,11 +162,9 @@ def get_or_run(self, candidate_text: str) -> Optional[ValidationReport]:
if not self.should_run():
return None
try:
self._evolved_path.write_text(
_manifest_json(self._tool_name, candidate_text)
)
self._artifact_writer(candidate_text, self._evolved_path)
inputs = ValidationInputs(
tool_name=self._tool_name,
tool_name=self._artifact_name,
suite=self._suite,
baseline_artifact=self._baseline_path,
evolved_artifact=self._evolved_path,
Expand Down Expand Up @@ -293,3 +306,26 @@ def _manifest_json(tool_name: str, description: str) -> str:
},
indent=2,
)


def _make_default_tool_writer(tool_name: str) -> ArtifactWriter:
"""Default ``artifact_writer`` for tool-side closed-loop.

Writes a single-tool MCP manifest JSON — the shape
``HermesToolDescriptionInstaller._extract_description`` consumes when
``artifact_source.suffix == ".json"``.
"""

def write(candidate_text: str, path: Path) -> None:
path.write_text(_manifest_json(tool_name, candidate_text))

return write


def write_text_artifact(candidate_text: str, path: Path) -> None:
"""``artifact_writer`` for skill-side closed-loop: drop raw text into the path.

The skill installer reads the whole file as the candidate SKILL.md
body, so no envelope is needed.
"""
path.write_text(candidate_text)
Loading
Loading