feat(recce): stdio MCP transport + Option E eval improvements#18
feat(recce): stdio MCP transport + Option E eval improvements#18
Conversation
- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh wrapper (venv auto-detection, no external server to manage) - Remove IMPACT_RULE from SessionStart hook — guidance now embedded in impact_analysis tool description + response (Option E) - Simplify session-start.sh: remove SSE server lifecycle management, replace with MCP_READY readiness check - Add agent constraint: prohibit Python/curl bypass of MCP tools Eval validation (ch3-phantom-filter, n=3): bare+Option E v2 delta: +2.7 (vs clean-profile +3.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speed up iterative eval by allowing: - --skip-setup: reuse pre-applied patch + dbt state (no re-setup) - --skip-teardown: keep state for subsequent parallel runs - --model: choose model (e.g., sonnet) for faster/cheaper runs Enables: setup once → parallel claude sessions → teardown once. Reduces n=3 batch time from ~30-60min to ~10-15min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)Batch ID: Results
Total: 13/14 runs successful, $5.07 total cost Key Findings
Configuration
Companion PR
|
…a patterns Agent was hallucinating data quality issues (e.g., "310 completed orders with $0") because the prompt asked "any data quality issues?" without defining what constitutes an actual pipeline bug vs inherent data patterns. Changes: - Prompt now explicitly asks for PIPELINE BUGS, not general data patterns - Lists known expected patterns (placed orders, partial payments, etc.) - Scoring keywords expanded: added "corrupted", "data loss", "regression" - Judge criteria updated: evaluate bug vs pattern distinction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval Results v3 — Full Suite with Proper IsolationBatch ID:
Results
Total cost: $7.14 | 14 runs | 0 failures Key Findings
Fixes in This Batch
|
Summary
.mcp.jsonnow uses stdio instead of SSE. Claude Code spawns the MCP server as a child process — no external server lifecycle to manage, no DuckDB lock conflicts, no port coordination.IMPACT_RULEfrom SessionStart hook. Guidance now lives in theimpact_analysistool description + response metadata (recce PR #1233). Eval-validated: bare-mode plugin delta +2.7 (comparable to hook-based +3.0).--skip-setup,--skip-teardown,--modelflags forrun-case.sh. Enables parallel runs + model selection, reducing batch time from ~30-60min to ~10-15min.Why stdio over SSE?
dbt runcurlthe SSE endpoint directlyEval validation data
ch3-phantom-filter (n=3, bare mode + Option E v2):
Companion change: recce#1233 (tool description + selector narrowing in
mcp_server.py)Test plan
--skip-setup+--model sonnet: single run completes correctly/recce-reviewworks with stdio transport🤖 Generated with Claude Code