jramos · jramos · May 26, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
diff --git a/README.md b/README.md
@@ -19,9 +19,11 @@ flowchart LR
     A[Read current<br/>skill/prompt/tool] --> B[Generate<br/>eval dataset]
     B --> C[GEPA<br/>Optimizer]
     C --> D[Candidate<br/>variants]
-    D --> E[Evaluate]
-    E -. Execution traces .-> C
-    E --> F["Constraint gates<br/>(tests, size limits,<br/>benchmarks)"]
+    D --> E1[Synthetic<br/>holdout]
+    D --> E2[Closed-loop<br/>behavioral suite]
+    E1 -. Execution traces .-> C
+    E1 --> F["Dual-signal deploy gate<br/>(synthetic + closed-loop;<br/>CL-primary on synth-tie)"]
+    E2 --> F
     F --> G[Best<br/>variant]
     G --> H[PR against<br/>source repo]
 ```
@@ -32,10 +34,11 @@ GEPA reads execution traces to understand *why* things fail (not just that they
 
 GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set.
 
-This framework adds two checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:
+This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:
 
 - **Held-out deploy check** — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors.
 - **Three-dimensional scoring** — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation.
+- **Closed-loop behavioral validation** — alongside the synthetic holdout, every candidate is exercised on a small behavioral task suite executed by a validator agent. The deploy gate consults both signals; when the synthetic signal is flat-within-tolerance (±0.05) but the behavioral signal demonstrably improves, the candidate ships via the closed-loop path. Documented end-to-end in [`reports/phase2_validation_report.pdf`](reports/phase2_validation_report.pdf).
 
 If you have hundreds of validation examples and a programmatic correctness metric (exact match, unit-test pass), raw GEPA is the right tool. The framework's extra layers earn their keep when validation is small and the metric is LLM-judged. See [docs/framework_advantages.md](docs/framework_advantages.md) for the deeper argument.
 
@@ -326,8 +329,8 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json
 
 | Phase | Target | Engine | Status |
 |-------|--------|--------|--------|
-| **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ Implemented |
-| **Phase 2** | Tool descriptions | DSPy + GEPA | ✅ Implemented |
+| **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) |
+| **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) |
 | **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned |
 | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned |
 | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned |

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -23,9 +23,11 @@ flowchart LR
     F --> G[Static<br/>constraints]
     G --> H{pass?}
     H -- no --> I[Write evolved_FAILED.md<br/>+ gate_decision.json]
-    H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
+    H -- yes --> J[Synthetic holdout<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
+    H -- yes --> CL[Closed-loop behavioral suite<br/>validator agent on JSONL tasks]
     J --> K[Paired bootstrap<br/>per-example deltas]
-    K --> L[Growth-with-quality<br/>gate]
+    K --> L[Dual-signal deploy gate<br/>synth + CL; decision_signal field<br/>CL-primary on synth-tie]
+    CL --> L
     L --> M{deploy?}
     M -- no --> I
     M -- yes --> N[Write evolved_skill.md<br/>+ metrics.json + gate_decision.json]
@@ -195,6 +197,7 @@ sequenceDiagram
     participant Val as ConstraintValidator
     participant Eval as dspy.Evaluate
     participant Boot as paired_bootstrap
+    participant CLV as ClosedLoopValidator
 
     CLI->>Disc: find_skill("obsidian")
     Disc-->>CLI: Path to SKILL.md
@@ -219,8 +222,12 @@ sequenceDiagram
     Eval-->>CLI: avg_evolved, evolved_per_example
     CLI->>Boot: paired_bootstrap(baseline_per_ex, evolved_per_ex)
     Boot-->>CLI: {mean, lower_bound, upper_bound, ...}
-    CLI->>Val: validate_growth_with_quality(evolved, baseline, bootstrap)
-    Val-->>CLI: [growth_quality_gate, absolute_char_ceiling]
+    opt closed-loop suite configured
+        CLI->>CLV: validate(baseline, evolved, suite.jsonl)
+        CLV-->>CLI: per-task pass/fail + aggregate deltas
+    end
+    CLI->>Val: validate_growth_with_quality(evolved, baseline, bootstrap, cl_report)
+    Val-->>CLI: [growth_quality_gate, cl_aware_gate, decision_signal]
     CLI->>CLI: write gate_decision.json + evolved_skill.md
 ```