jramos · jramos · May 17, 2026 · May 17, 2026 · May 17, 2026 · May 17, 2026
diff --git a/PLAN.md b/PLAN.md
@@ -372,7 +372,7 @@ The agent can self-invoke optimization:
 
 1. **Benchmark gating (TBLite/YC-Bench) was not wired as a built-in.** The plan made "TBLite within 2%" the next-phase gate. The framework went framework-agnostic (any agent that emits `SKILL.md`), so the Hermes-specific benchmark suites no longer apply uniformly. The built-in deploy gate is a paired-bootstrap CI on a held-out split of the skill's own eval set (`evolution/core/stats.py` + `_check_growth_with_quality_gate`). For users who want a benchmark-style gate, `--benchmark-cmd "<shell command>"` runs an arbitrary user-provided command after the built-in gate passes; nonzero exit flips the decision to reject with `reason="benchmark_failed"` and a `benchmark` block in `gate_decision.json`. This restores the plan author's "benchmarks as gates" vision in a framework-agnostic way: TBLite users wire `--benchmark-cmd "tblite-runner --threshold 0.02"`, no built-in coupling.
 2. **Eval datasets are larger than the planned 15–30 examples.** `eval_dataset_size=150` is the current default, sized so the bootstrap CI half-width can detect ±2% effects. Rationale and the supporting study live in `reports/calibration_findings.md`.
-3. **Source D (skill-specific auto-evaluation) is now built for tool descriptions; skill-side auto-eval still deferred.** `evolution/validation/closed_loop` drives a real Hermes Agent via `hermes -z` against a JSONL task suite, scores actual tool-selection behavior with both expected and forbidden tools, compares baseline vs. evolved with a two-condition decision rule (no aggregate regression + no per-task loss unless wins offset 2:1). v1 ships a 5-task suite for `patch`. Skill-side equivalents (planted-bug for `systematic-debugging`, known-paper recall for `arxiv`, planted-issue for `github-code-review`) remain deferred — same harness shape, needs different `ArtifactInstaller` impls.
+3. **Source D (skill-specific auto-evaluation) is built for tool descriptions and for `systematic-debugging`; remaining skill-side suites still deferred.** `evolution/validation/closed_loop` drives a real Hermes Agent via `hermes -z` against a JSONL task suite, scores actual tool-selection behavior with both expected and forbidden tools, compares baseline vs. evolved with a two-condition decision rule (no aggregate regression + no per-task loss unless wins offset 2:1). v1 ships a 5-task suite for `patch` (tool-side). Skill-side now ships `systematic-debugging` via a new `SkillFileInstaller` plus a `test_command` verdict mode (exit code of a planted test replaces tool-call membership scoring); same harness, different artifact + scoring shape. Remaining skill-side equivalents (known-paper recall for `arxiv`, planted-issue for `github-code-review`) remain deferred — same harness, different verdict mechanisms (rubric-based for arxiv, planted-issue identification for code-review).
 4. **The selection and gating layer is much thicker than the plan anticipated.** Knee-point Pareto selection, a budget-aware reflection-prompt proposer with growth/balanced/compression modes, paired-bootstrap CI, non-inferiority gate option, and the `gate_decision.json` v4 audit schema were added on top of raw GEPA. Rationale: raw "ship the best valset candidate" overfits at small N. See `docs/framework_advantages.md`.
 5. **CLI shipped as `python -m evolution.skills.evolve_skill`, not a `hermes evolve skill` subcommand.** The framework rebranded out of Hermes-only branding so it could target any agent framework whose skills are `SKILL.md` files.
 6. **`--run-tests` (target-repo pytest as a hard gate) was specified but never wired and has been removed.** Skill text doesn't touch target-repo Python, so coupling the deploy decision to the target repo's test suite would gate on unrelated signal. The CLI flag set a config field that no code read. The unused `EvolutionConfig.run_pytest`, `run_tblite`, `tblite_regression_threshold` fields and `ConstraintValidator.run_test_suite()` method were removed alongside the flag.

diff --git a/docs/workflows.md b/docs/workflows.md
@@ -479,6 +479,17 @@ Per-task scores are deterministic over candidate text (cache is keyed by `sha256
 
 `--closed-loop-mode both` does both: trainset behavioral examples for acceptance, plus the `[CLOSED_LOOP]` feedback block on non-behavioral examples for reflection.
 
+### Skill-path equivalent
+
+`evolve_skill` exposes the same `--closed-loop-*` flags. Two differences from the tool path:
+
+- **Verdict mechanism is `test_command`, not tool-call membership.** Skill-side suites set `"test_command": "python test_solution.py"` on each task; the validator runs that command in `fixture_dir` after the agent and passes iff exit code is zero. `expected_tools` / `forbidden_tools` aren't meaningful for "did the agent debug correctly"-shape verdicts. The decision rule (two-condition: aggregate no-regression + per-task wins offset losses 2:1) is unchanged.
+- **`SkillFileInstaller` instead of in-place tool description splice.** The user's actual skill may live in a read-only plugin cache, so the installer copies the baseline skill directory into a writable workdir at construction and mutates the copy. `HermesAgentRunner._prime_sandbox` reads `TaskRunContext.skills_src` and copies that workdir's `skills/` into each per-task sandbox so `hermes -z` discovers the candidate. The user's source skill is never touched.
+
+Default `--closed-loop-mode` is `feedback` (not `trainset`) on the skill side. Skill bodies mutate heavily, so the `gate_mode="always"` that trainset needs would fire the validator on every novel candidate — N tasks × 2 phases per fire. Opt into `trainset` / `both` explicitly when the cost is acceptable.
+
+Reference suite: `evolution/validation/suites/systematic_debugging.jsonl` (5 planted-bug tasks). Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py`.
+
 ## Failure-mode summary
 
 | Trigger | Outcome | Where to look |

diff --git a/evolution/core/behavioral_example.py b/evolution/core/behavioral_example.py
@@ -7,10 +7,13 @@
 and get accepted.
 
 The example carries a ``closed_loop_task_id`` marker the metric routes on,
-plus a placeholder ``task`` value (the suite's ``user_message``) that
-``ToolModule.forward`` skips past on the behavioral branch. ``task`` and
-``closed_loop_task_id`` are both marked as input keys so DSPy passes them
-to ``forward()`` via ``program(**example.inputs())``.
+plus a placeholder task-input value (the suite's ``user_message``) that the
+module's ``forward()`` skips past on the behavioral branch. Both fields are
+marked as input keys so DSPy passes them via ``program(**example.inputs())``.
+
+``task_field`` parameterizes the input field name to match the host module's
+forward signature: ``ToolModule.forward(task=...)`` uses ``"task"`` (the
+default); ``SkillModule.forward(task_input=...)`` passes ``"task_input"``.
 """
 
 from __future__ import annotations
@@ -20,17 +23,15 @@
 from evolution.validation.task import TaskSuite
 
 
-def build_behavioral_examples(suite: TaskSuite) -> list[dspy.Example]:
-    """One example per task in ``suite``, stable order by ``task_id``.
-
-    The placeholder ``task`` value carries the original ``user_message`` for
-    debuggability; it isn't consumed by the behavioral metric branch.
-    """
+def build_behavioral_examples(
+    suite: TaskSuite, *, task_field: str = "task"
+) -> list[dspy.Example]:
+    """One example per task in ``suite``, stable order by ``task_id``."""
     examples = [
         dspy.Example(
-            task=task.user_message,
+            **{task_field: task.user_message},
             closed_loop_task_id=task.task_id,
-        ).with_inputs("task", "closed_loop_task_id")
+        ).with_inputs(task_field, "closed_loop_task_id")
         for task in sorted(suite.tasks, key=lambda t: t.task_id)
     ]
     return examples
diff --git a/evolution/core/closed_loop_feedback.py b/evolution/core/closed_loop_feedback.py
@@ -27,7 +27,7 @@
 import tempfile
 import threading
 from pathlib import Path
-from typing import Literal, Optional
+from typing import Callable, Literal, Optional
 
 from evolution.validation.report import TaskResult, ValidationReport
 from evolution.validation.task import TaskSuite
@@ -42,6 +42,15 @@
 
 GateMode = Literal["sampled", "always"]
 
+ArtifactWriter = Callable[[str, Path], None]
+"""Write candidate text to a path in the format the installer consumes.
+
+The cache calls this with ``(baseline_or_candidate_text, target_path)``
+before each validator run. The default writes a single-tool MCP manifest
+JSON (the tool-side shape); skill-side passes a writer that drops raw
+text directly into the path.
+"""
+
 logger = logging.getLogger(__name__)
 
 
@@ -67,12 +76,14 @@ def __init__(
         *,
         validator: ClosedLoopValidator,
         suite: TaskSuite,
-        tool_name: str,
-        baseline_description: str,
+        artifact_name: str,
+        baseline_artifact_text: str,
         saturation_threshold: float = 0.95,
         min_iters: int = 3,
         window_size: int = 8,
         gate_mode: GateMode = "sampled",
+        artifact_writer: Optional[ArtifactWriter] = None,
+        artifact_suffix: str = ".json",
     ) -> None:
         if not (0.0 <= saturation_threshold <= 1.0):
             raise ValueError(
@@ -88,19 +99,23 @@ def __init__(
             )
         self._validator = validator
         self._suite = suite
-        self._tool_name = tool_name
+        self._artifact_name = artifact_name
         self.saturation_threshold = saturation_threshold
         self.min_iters = min_iters
         self.window_size = window_size
         self.gate_mode = gate_mode
 
-        self._tmp_dir = Path(tempfile.mkdtemp(prefix="cl_feedback_"))
-        self._baseline_path = self._tmp_dir / "baseline.json"
-        self._evolved_path = self._tmp_dir / "evolved.json"
-        self._baseline_path.write_text(
-            _manifest_json(tool_name, baseline_description)
+        self._artifact_writer: ArtifactWriter = (
+            artifact_writer
+            if artifact_writer is not None
+            else _make_default_tool_writer(artifact_name)
         )
 
+        self._tmp_dir = Path(tempfile.mkdtemp(prefix="cl_feedback_"))
+        self._baseline_path = self._tmp_dir / f"baseline{artifact_suffix}"
+        self._evolved_path = self._tmp_dir / f"evolved{artifact_suffix}"
+        self._artifact_writer(baseline_artifact_text, self._baseline_path)
+
         self._cache: dict[str, ValidationReport] = {}
         self._judge_history: list[float] = []
         self._iters_since_last_run = self.min_iters  # allow first fire
@@ -147,11 +162,9 @@ def get_or_run(self, candidate_text: str) -> Optional[ValidationReport]:
             if not self.should_run():
                 return None
             try:
-                self._evolved_path.write_text(
-                    _manifest_json(self._tool_name, candidate_text)
-                )
+                self._artifact_writer(candidate_text, self._evolved_path)
                 inputs = ValidationInputs(
-                    tool_name=self._tool_name,
+                    tool_name=self._artifact_name,
                     suite=self._suite,
                     baseline_artifact=self._baseline_path,
                     evolved_artifact=self._evolved_path,
@@ -293,3 +306,26 @@ def _manifest_json(tool_name: str, description: str) -> str:
         },
         indent=2,
     )
+
+
+def _make_default_tool_writer(tool_name: str) -> ArtifactWriter:
+    """Default ``artifact_writer`` for tool-side closed-loop.
+
+    Writes a single-tool MCP manifest JSON — the shape
+    ``HermesToolDescriptionInstaller._extract_description`` consumes when
+    ``artifact_source.suffix == ".json"``.
+    """
+
+    def write(candidate_text: str, path: Path) -> None:
+        path.write_text(_manifest_json(tool_name, candidate_text))
+
+    return write
+
+
+def write_text_artifact(candidate_text: str, path: Path) -> None:
+    """``artifact_writer`` for skill-side closed-loop: drop raw text into the path.
+
+    The skill installer reads the whole file as the candidate SKILL.md
+    body, so no envelope is needed.
+    """
+    path.write_text(candidate_text)