Skip to content

Commit b9cedca

Browse files
jonathanpeterwuStackMemory Bot (CLI)gitbutler-client
authored
chore: root reorg + GEPA skill optimization (#12)
* feat(conductor): GitButler virtual branch mode for workspace management * fix(conductor): state filter + labels flatten for issue dispatch * fix(linear): flatten labels in getIssues response * feat(cross-search): multi-database frame search across projects (STA-480) - Add CrossProjectSearch engine with FTS5/BM25 ranking across N databases - Project registry (~/.stackmemory/projects.json) with CRUD + auto-discovery - Read-only SQLite connections for safety, LIKE fallback for non-FTS databases - 4 MCP tools: sm_cross_search, sm_cross_discover, sm_cross_register, sm_cross_list - CLI: `stackmemory search --all-projects "query"` for cross-project search - 17 tests: registry CRUD, multi-db FTS5 search, ranking, LIKE fallback, graceful skip * feat(shared-state): add canonical instance coordination * feat: add deterministic harness smoke tooling * docs: add design principles architecture note * chore: update gepa baselines and clean GitButler hooks * fix(conductor): harden lane mode cleanup * chore: reorganize root for clarity Consolidate duplicate docs, relocate wandering files, and tighten .gitignore for agent scratch dirs. - Move SPEC.md, RELEASE_NOTES.md, tomorrow.md, vision.md to docs/ (replacing stale docs/ copies with the up-to-date root versions) - Move mcp_review_config.json to config/ - Untrack .lint-fix-log.json (ephemeral lint artifact) - Delete stale .tsbuildinfo-* and .lint-errors.log - Ignore agent scratch dirs (.ralph/, .swarm/, .bjarne/, .entire/, .opencode/, .git.backup/) and local trees (archive/, site/, voyager/, plugins/) - Update README.md Vision link to docs/vision.md * fix(test): mock canonicalStateStore in session tests Session tests mocked fs/promises but not the canonical-store module. The canonicalStateStore singleton inherited the mocked fs, causing pathExists to return true while readFile returned undefined — crashing JSON.parse. Mock the entire canonical-store module with stubs for upsertSession, appendEvent, and endSession. * chore: handoff checkpoint on chore/root-reorg * feat(gepa): phase-level prompt optimization with auto-targeting Split conductor prompt-template.md into 5 phase files (system, understand, implement, validate, deliver). GEPA now auto-targets the worst-performing phase from outcomes.jsonl instead of mutating the entire template as a monolith. - Phase-aware prompt building in orchestrator with DSPy bridge - Assertion-based retry injects phase-specific error guidance - promptVersions hash map in AgentOutcomeEntry for attribution - Stop hook fires GEPA session accumulator (auto-optimize at threshold) - after-run.sh triggers GEPA + DSPy (every 50 runs) automatically - Gold sets mined from 71 outcomes across 4 phases - eval-phases.js harness validates mutations before applying - npm run gepa:eval / gepa:mine scripts * chore: handoff checkpoint on chore/root-reorg * feat(gepa): skill .md optimization with audit hook Add GEPA support for optimizing Claude Code slash command .md files: - skill-audit.js hook logs Skill tool calls to skill-audit.jsonl - 5 skill targets in config (start, stop, learn, next, summary) - skill-tasks.jsonl with 8 eval tasks for skill quality - skill-stats and run-skills CLI commands - getSkillAuditContext() feeds usage data into mutation prompts * chore(gepa): update baseline generations with current CLAUDE.md * fix(gepa): judge CLI fallback + filter phase variants for skill targets * feat(gepa): elitism, crossover, ASI feedback, eval cache, expanded evals - Add API key validation at startup (fail fast before burning budget) - Fix callJudge() to log errors, use config timeout (120s vs 30s) - Add ASI feedback field to judge schema (CoT + actionable suggestions) - Persist judge feedback to results/feedback-{gen}.json - Inject ASI feedback into mutation prompts via getRecentFeedback() - Add extractCodeBlocks() for regex judge (focus on code, not prose) - Add 10 new regex criterion patterns (shows_branch, concise_output, etc) - Support custom regex from eval task definitions - Add elitism tiebreaker (prefer baseline/incumbent on score ties) - Add crossover operator (recombine sections from two parent variants) - Add eval response cache (record/replay for deterministic baselines) - Expand skill eval tasks from 8 to 30 with adversarial cases - Add held-out eval partition (train/test split for Goodhart detection) - Increase population 4→8, add crossoverCount=2, judge timeout 120s --------- Co-authored-by: StackMemory Bot (CLI) <bot@stackmemory.ai> Co-authored-by: GitButler <gitbutler@gitbutler.com>
1 parent 58807d9 commit b9cedca

20 files changed

Lines changed: 2115 additions & 222 deletions

.claude/settings.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,16 @@
5252
"command": "entire hooks claude-code stop"
5353
}
5454
]
55+
},
56+
{
57+
"matcher": "",
58+
"hooks": [
59+
{
60+
"type": "command",
61+
"command": "node scripts/gepa/hooks/gepa-session-hook.js",
62+
"async": true
63+
}
64+
]
5565
}
5666
],
5767
"PreToolUse": [

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ scripts/gepa/results/scores.jsonl
135135
scripts/gepa/state.json
136136
scripts/gepa/results/
137137
scripts/gepa/generations/
138+
scripts/gepa/cache/
138139

139140
# Agent tool working dirs (untracked, per-tool scratch)
140141
.ralph/

CLAUDE.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,12 +258,18 @@ For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor
258258
- Prioritizes: unfinished work > flagged issues > queued tasks > continuations
259259
- Trigger: session start, "what's next", "whats next", between tasks
260260

261+
**`/learn`** — Run at session end to capture learnings:
262+
- Reviews session work, then audits memory, CLAUDE.md, skills, scripts, and wiki
263+
- Proposes creates/updates/deletes with confirmation before applying
264+
- Trigger: end of session, after significant work, "what should I update"
265+
261266
**When to use which:**
262267
- Starting a session or between tasks → `/next` (pick what to work on)
263268
- Session producing wrong results → `/recover` (diagnose + fix now)
264269
- Routine maintenance, nothing broken → `/update-docs` (proactive gardening)
265270
- After publishing a new version → `/update-docs` (catch version/path drift)
266271
- After conductor failures → `/recover last` (learn from agent traces)
272+
- End of session → `/learn` (capture what changed, update artifacts)
267273

268274
## Workflow
269275

docs/prds/substrate-enterprise-brain.md

Lines changed: 632 additions & 0 deletions
Large diffs are not rendered by default.

package.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,9 @@
141141
"sync:start": "node scripts/background-sync-manager.js",
142142
"sync:setup": "./scripts/setup-background-sync.sh",
143143
"eval:cord": "npx tsx scripts/evals/cord-vs-flat-eval.ts",
144+
"gepa:eval": "node scripts/gepa/eval-phases.js",
145+
"gepa:eval:json": "node scripts/gepa/eval-phases.js --json",
146+
"gepa:mine": "node scripts/gepa/gold/mine-traces.js",
144147
"prepare": "echo 'Prepare step completed'",
145148
"verify:dist": "node scripts/verify-dist.cjs",
146149
"test:smoke-db": "bash scripts/smoke-init-db.sh",

scripts/conductor/after-run.sh

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,42 @@
11
#!/usr/bin/env bash
22
# Conductor after_run hook
3-
# Captures context from the agent run and tags it with the issue identifier
4-
# Called after each agent attempt (success or failure)
3+
# 1. Captures context from the agent run
4+
# 2. Triggers GEPA session hook (accumulates toward auto-optimization)
5+
# 3. Triggers DSPy optimization every 50 runs
56
#
67
# Environment: SYMPHONY_WORKSPACE_DIR, SYMPHONY_ISSUE_ID, SYMPHONY_ISSUE_IDENTIFIER
78
set -euo pipefail
89

910
WORKSPACE="${SYMPHONY_WORKSPACE_DIR:-$(pwd)}"
1011
ISSUE_ID="${SYMPHONY_ISSUE_IDENTIFIER:-${SYMPHONY_ISSUE_ID:-unknown}}"
1112
ATTEMPT="${SYMPHONY_ATTEMPT:-1}"
13+
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
14+
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
1215

1316
cd "$WORKSPACE"
1417

15-
# Capture context from this run, tagged with issue ID and attempt number
18+
# 1. Capture context from this run, tagged with issue ID and attempt number
1619
stackmemory conductor capture \
1720
--issue "$ISSUE_ID" \
1821
--workspace "$WORKSPACE" \
1922
--attempt "$ATTEMPT" \
2023
2>/dev/null || true
2124

2225
echo "[conductor] Context captured for $ISSUE_ID (attempt $ATTEMPT)"
26+
27+
# 2. Trigger GEPA session hook (accumulates sessions, auto-optimizes at threshold)
28+
GEPA_HOOK="$PROJECT_ROOT/scripts/gepa/hooks/gepa-session-hook.js"
29+
if [ -f "$GEPA_HOOK" ]; then
30+
node "$GEPA_HOOK" 2>/dev/null &
31+
fi
32+
33+
# 3. Trigger DSPy optimization every 50 agent runs
34+
OUTCOMES_PATH="$HOME/.stackmemory/conductor/outcomes.jsonl"
35+
DSPY_OPTIMIZE="$PROJECT_ROOT/scripts/dspy/optimize.py"
36+
if [ -f "$OUTCOMES_PATH" ] && [ -f "$DSPY_OPTIMIZE" ]; then
37+
OUTCOMES_COUNT=$(wc -l < "$OUTCOMES_PATH" 2>/dev/null || echo 0)
38+
if [ $((OUTCOMES_COUNT % 50)) -eq 0 ] && [ "$OUTCOMES_COUNT" -gt 0 ]; then
39+
echo "[conductor] Triggering DSPy optimization (${OUTCOMES_COUNT} runs)"
40+
nohup python3 "$DSPY_OPTIMIZE" --quiet >/dev/null 2>&1 &
41+
fi
42+
fi

scripts/gepa/config.json

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,43 @@
3232
"file": "CLAUDE.md",
3333
"evals": ["stackmemory-tasks.jsonl"],
3434
"description": "StackMemory project prompt"
35+
},
36+
{
37+
"name": "skill:start",
38+
"file": "~/.claude/commands/start.md",
39+
"evals": ["skill-tasks.jsonl"],
40+
"description": "Session boot skill"
41+
},
42+
{
43+
"name": "skill:stop",
44+
"file": "~/.claude/commands/stop.md",
45+
"evals": ["skill-tasks.jsonl"],
46+
"description": "Session close skill"
47+
},
48+
{
49+
"name": "skill:learn",
50+
"file": "~/.claude/commands/learn.md",
51+
"evals": ["skill-tasks.jsonl"],
52+
"description": "Session review + artifact update skill"
53+
},
54+
{
55+
"name": "skill:next",
56+
"file": "~/.claude/commands/next.md",
57+
"evals": ["skill-tasks.jsonl"],
58+
"description": "Next action recommendation skill"
59+
},
60+
{
61+
"name": "skill:summary",
62+
"file": "~/.claude/commands/summary.md",
63+
"evals": ["skill-tasks.jsonl"],
64+
"description": "Session summary skill"
3565
}
3666
],
3767

3868
"evolution": {
39-
"populationSize": 4,
69+
"populationSize": 8,
70+
"crossoverCount": 2,
71+
"elitism": true,
4072
"generations": 10,
4173
"selectionRate": 0.5,
4274
"selfReview": true,
@@ -58,8 +90,9 @@
5890

5991
"evals": {
6092
"directory": "./evals",
61-
"minSamplesPerVariant": 8,
93+
"minSamplesPerVariant": 25,
6294
"timeout": 120000,
95+
"heldOutPartition": true,
6396
"metrics": [
6497
"task_completion",
6598
"code_quality",
@@ -73,7 +106,8 @@
73106
"judge": {
74107
"model": "claude-haiku-4-5-20251001",
75108
"maxOutputTokens": 2000,
76-
"timeoutMs": 30000
109+
"timeoutMs": 120000,
110+
"feedbackEnabled": true
77111
},
78112

79113
"mutation": {
@@ -144,6 +178,26 @@
144178
"evals": {
145179
"files": ["conductor-tasks.jsonl"]
146180
}
181+
},
182+
"skills": {
183+
"target": {
184+
"file": "~/.claude/commands/start.md",
185+
"scope": "user",
186+
"backup": true
187+
},
188+
"evolution": {
189+
"mutationStrategies": [
190+
"simplify",
191+
"add_examples",
192+
"rephrase",
193+
"add_constraints",
194+
"reduce_overengineering",
195+
"add_self_check"
196+
]
197+
},
198+
"evals": {
199+
"files": ["skill-tasks.jsonl"]
200+
}
147201
}
148202
}
149203
}

scripts/gepa/eval-phases.js

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
#!/usr/bin/env node
2+
/**
3+
* Phase-level eval harness for GEPA.
4+
*
5+
* Evaluates conductor prompt phase files against gold sets.
6+
* Scores each phase independently. Used by GEPA auto-optimization
7+
* to validate mutations before applying.
8+
*
9+
* Usage:
10+
* node eval-phases.js # eval all phases
11+
* node eval-phases.js --phase validate # eval single phase
12+
* node eval-phases.js --json # JSON output for CI
13+
*/
14+
15+
import fs from 'fs';
16+
import path from 'path';
17+
import { fileURLToPath } from 'url';
18+
import { homedir } from 'os';
19+
20+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
21+
const GOLD_DIR = path.join(__dirname, 'gold');
22+
const PROMPTS_DIR = path.join(
23+
homedir(),
24+
'.stackmemory',
25+
'conductor',
26+
'prompts'
27+
);
28+
29+
const PHASES = ['understand', 'implement', 'validate', 'deliver'];
30+
31+
// Parse args
32+
const phaseIdx = process.argv.indexOf('--phase');
33+
const targetPhase = phaseIdx !== -1 ? process.argv[phaseIdx + 1] : null;
34+
const jsonOutput = process.argv.includes('--json');
35+
36+
/**
37+
* Load gold set for a phase
38+
*/
39+
function loadGoldSet(phase) {
40+
const goldPath = path.join(GOLD_DIR, `${phase}.jsonl`);
41+
if (!fs.existsSync(goldPath)) return [];
42+
return fs
43+
.readFileSync(goldPath, 'utf-8')
44+
.split('\n')
45+
.filter(Boolean)
46+
.map((l) => JSON.parse(l));
47+
}
48+
49+
/**
50+
* Score a phase prompt against its gold set using heuristic evaluation.
51+
* This is a fast, offline eval (no LLM calls) based on outcome patterns.
52+
*
53+
* For LLM-judge evaluation, use the full GEPA optimize.js eval pipeline.
54+
*/
55+
function evalPhase(phase) {
56+
const goldSet = loadGoldSet(phase);
57+
if (goldSet.length === 0) {
58+
return { phase, score: 0, total: 0, passed: 0, skipped: true };
59+
}
60+
61+
const promptPath = path.join(PROMPTS_DIR, `${phase}.md`);
62+
if (!fs.existsSync(promptPath)) {
63+
return { phase, score: 0, total: goldSet.length, passed: 0, missing: true };
64+
}
65+
66+
const prompt = fs.readFileSync(promptPath, 'utf-8');
67+
let passed = 0;
68+
const failures = [];
69+
70+
for (const entry of goldSet) {
71+
const expected = entry.expected;
72+
if (!expected) continue;
73+
74+
// Heuristic: check if the prompt addresses the failure patterns
75+
let entryPassed = true;
76+
77+
switch (phase) {
78+
case 'understand': {
79+
// Check if prompt guides complexity assessment
80+
if (expected.complexity === 'careful' && !prompt.includes('plan')) {
81+
entryPassed = false;
82+
}
83+
break;
84+
}
85+
86+
case 'implement': {
87+
// Check if prompt constrains scope
88+
if (!expected.scopeKept && !prompt.includes('scope')) {
89+
entryPassed = false;
90+
}
91+
// Check ESM import guidance
92+
if (
93+
entry.errorTail &&
94+
/import|ESM/i.test(entry.errorTail) &&
95+
!prompt.includes('.js')
96+
) {
97+
entryPassed = false;
98+
}
99+
break;
100+
}
101+
102+
case 'validate': {
103+
// Check if prompt covers the specific failure type
104+
if (expected.retryStrategy === 'fix_lint' && !prompt.includes('lint')) {
105+
entryPassed = false;
106+
}
107+
if (expected.retryStrategy === 'fix_test' && !prompt.includes('test')) {
108+
entryPassed = false;
109+
}
110+
if (
111+
expected.retryStrategy === 'fix_build' &&
112+
!prompt.includes('build')
113+
) {
114+
entryPassed = false;
115+
}
116+
// Check --no-verify prevention
117+
if (!prompt.includes('no-verify') && !prompt.includes('--no-verify')) {
118+
entryPassed = false;
119+
}
120+
break;
121+
}
122+
123+
case 'deliver': {
124+
// Check commit format guidance
125+
if (!prompt.includes('type(scope)') && !prompt.includes('commit')) {
126+
entryPassed = false;
127+
}
128+
break;
129+
}
130+
}
131+
132+
if (entryPassed) {
133+
passed++;
134+
} else {
135+
failures.push({
136+
issue: entry.issue,
137+
outcome: entry.outcome,
138+
reason: `Prompt missing guidance for: ${JSON.stringify(expected)}`,
139+
});
140+
}
141+
}
142+
143+
return {
144+
phase,
145+
score: goldSet.length > 0 ? passed / goldSet.length : 0,
146+
total: goldSet.length,
147+
passed,
148+
failures: failures.slice(0, 5), // top 5 failures
149+
};
150+
}
151+
152+
// Main
153+
const phases = targetPhase ? [targetPhase] : PHASES;
154+
const results = phases.map(evalPhase);
155+
156+
if (jsonOutput) {
157+
console.log(JSON.stringify(results, null, 2));
158+
} else {
159+
console.log('GEPA Phase Evaluation');
160+
console.log('═'.repeat(50));
161+
162+
let totalScore = 0;
163+
let totalPhases = 0;
164+
165+
for (const r of results) {
166+
if (r.skipped) {
167+
console.log(` ${r.phase.padEnd(12)} — no gold set`);
168+
continue;
169+
}
170+
if (r.missing) {
171+
console.log(` ${r.phase.padEnd(12)} — prompt file missing`);
172+
continue;
173+
}
174+
175+
const pct = (r.score * 100).toFixed(1);
176+
const bar = '█'.repeat(Math.round(r.score * 20)).padEnd(20, '░');
177+
const status = r.score >= 0.7 ? '✓' : r.score >= 0.4 ? '~' : '✗';
178+
console.log(
179+
` ${status} ${r.phase.padEnd(12)} ${bar} ${pct}% (${r.passed}/${r.total})`
180+
);
181+
182+
if (r.failures && r.failures.length > 0) {
183+
for (const f of r.failures.slice(0, 3)) {
184+
console.log(` └ ${f.issue}: ${f.reason.slice(0, 80)}`);
185+
}
186+
}
187+
188+
totalScore += r.score;
189+
totalPhases++;
190+
}
191+
192+
if (totalPhases > 0) {
193+
const avg = ((totalScore / totalPhases) * 100).toFixed(1);
194+
console.log('─'.repeat(50));
195+
console.log(` Average: ${avg}%`);
196+
}
197+
}

0 commit comments

Comments
 (0)