Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,11 @@ flowchart LR
A[Read current<br/>skill/prompt/tool] --> B[Generate<br/>eval dataset]
B --> C[GEPA<br/>Optimizer]
C --> D[Candidate<br/>variants]
D --> E[Evaluate]
E -. Execution traces .-> C
E --> F["Constraint gates<br/>(tests, size limits,<br/>benchmarks)"]
D --> E1[Synthetic<br/>holdout]
D --> E2[Closed-loop<br/>behavioral suite]
E1 -. Execution traces .-> C
E1 --> F["Dual-signal deploy gate<br/>(synthetic + closed-loop;<br/>CL-primary on synth-tie)"]
E2 --> F
F --> G[Best<br/>variant]
G --> H[PR against<br/>source repo]
```
Expand All @@ -32,10 +34,11 @@ GEPA reads execution traces to understand *why* things fail (not just that they

GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set.

This framework adds two checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:
This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:

- **Held-out deploy check** — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors.
- **Three-dimensional scoring** — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation.
- **Closed-loop behavioral validation** — alongside the synthetic holdout, every candidate is exercised on a small behavioral task suite executed by a validator agent. The deploy gate consults both signals; when the synthetic signal is flat-within-tolerance (±0.05) but the behavioral signal demonstrably improves, the candidate ships via the closed-loop path. Documented end-to-end in [`reports/phase2_validation_report.pdf`](reports/phase2_validation_report.pdf).

If you have hundreds of validation examples and a programmatic correctness metric (exact match, unit-test pass), raw GEPA is the right tool. The framework's extra layers earn their keep when validation is small and the metric is LLM-judged. See [docs/framework_advantages.md](docs/framework_advantages.md) for the deeper argument.

Expand Down Expand Up @@ -326,8 +329,8 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json

| Phase | Target | Engine | Status |
|-------|--------|--------|--------|
| **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ Implemented |
| **Phase 2** | Tool descriptions | DSPy + GEPA | ✅ Implemented |
| **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) |
| **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) |
| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned |
| **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned |
| **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned |
Expand Down
15 changes: 11 additions & 4 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,11 @@ flowchart LR
F --> G[Static<br/>constraints]
G --> H{pass?}
H -- no --> I[Write evolved_FAILED.md<br/>+ gate_decision.json]
H -- yes --> J[Holdout eval<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
H -- yes --> J[Synthetic holdout<br/>dspy.Evaluate × 1 evolved<br/>baseline reused from SAT]
H -- yes --> CL[Closed-loop behavioral suite<br/>validator agent on JSONL tasks]
J --> K[Paired bootstrap<br/>per-example deltas]
K --> L[Growth-with-quality<br/>gate]
K --> L[Dual-signal deploy gate<br/>synth + CL; decision_signal field<br/>CL-primary on synth-tie]
CL --> L
L --> M{deploy?}
M -- no --> I
M -- yes --> N[Write evolved_skill.md<br/>+ metrics.json + gate_decision.json]
Expand Down Expand Up @@ -195,6 +197,7 @@ sequenceDiagram
participant Val as ConstraintValidator
participant Eval as dspy.Evaluate
participant Boot as paired_bootstrap
participant CLV as ClosedLoopValidator

CLI->>Disc: find_skill("obsidian")
Disc-->>CLI: Path to SKILL.md
Expand All @@ -219,8 +222,12 @@ sequenceDiagram
Eval-->>CLI: avg_evolved, evolved_per_example
CLI->>Boot: paired_bootstrap(baseline_per_ex, evolved_per_ex)
Boot-->>CLI: {mean, lower_bound, upper_bound, ...}
CLI->>Val: validate_growth_with_quality(evolved, baseline, bootstrap)
Val-->>CLI: [growth_quality_gate, absolute_char_ceiling]
opt closed-loop suite configured
CLI->>CLV: validate(baseline, evolved, suite.jsonl)
CLV-->>CLI: per-task pass/fail + aggregate deltas
end
CLI->>Val: validate_growth_with_quality(evolved, baseline, bootstrap, cl_report)
Val-->>CLI: [growth_quality_gate, cl_aware_gate, decision_signal]
CLI->>CLI: write gate_decision.json + evolved_skill.md
```

Expand Down
Loading
Loading