feat(recce): stdio MCP transport + Option E eval improvements by iamcxa · Pull Request #18 · DataRecce/recce-claude-plugin

iamcxa · 2026-03-24T08:49:46Z

Summary

SSE → stdio MCP transport: Plugin .mcp.json now uses stdio instead of SSE. Claude Code spawns the MCP server as a child process — no external server lifecycle to manage, no DuckDB lock conflicts, no port coordination.
Option E — tool-embedded guidance: Removed IMPACT_RULE from SessionStart hook. Guidance now lives in the impact_analysis tool description + response metadata (recce PR #1233). Eval-validated: bare-mode plugin delta +2.7 (comparable to hook-based +3.0).
Agent MCP bypass prohibition: Reviewer agent now explicitly forbidden from using Python/curl to interact with SSE endpoints directly.
Eval acceleration: --skip-setup, --skip-teardown, --model flags for run-case.sh. Enables parallel runs + model selection, reducing batch time from ~30-60min to ~10-15min.

Why stdio over SSE?

	SSE (before)	stdio (after)
Server lifecycle	Plugin hook starts/stops SSE server	Claude Code spawns on demand
DuckDB locks	SSE holds persistent read connection → blocks `dbt run`	No persistent connection — spawned per session
Port conflicts	Hardcoded 8081, multi-project collisions	No ports — stdin/stdout
Agent bypass risk	Agent can `curl` the SSE endpoint directly	No HTTP endpoint to bypass

Eval validation data

ch3-phantom-filter (n=3, bare mode + Option E v2):

Run	Score
1	12/12
2	12/12
3	12/12

Companion change: recce#1233 (tool description + selector narrowing in mcp_server.py)

Test plan

Eval validation: bare-mode n=3, ch3-phantom-filter → 12/12 (2 of 3 valid runs perfect)
Parallel stdio MCP: 3 concurrent sessions, zero DuckDB lock failures
--skip-setup + --model sonnet: single run completes correctly
Manual: install plugin locally, verify /recce-review works with stdio transport

🤖 Generated with Claude Code

- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh wrapper (venv auto-detection, no external server to manage) - Remove IMPACT_RULE from SessionStart hook — guidance now embedded in impact_analysis tool description + response (Option E) - Simplify session-start.sh: remove SSE server lifecycle management, replace with MCP_READY readiness check - Add agent constraint: prohibit Python/curl bypass of MCP tools Eval validation (ch3-phantom-filter, n=3): bare+Option E v2 delta: +2.7 (vs clean-profile +3.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed up iterative eval by allowing: - --skip-setup: reuse pre-applied patch + dbt state (no re-setup) - --skip-teardown: keep state for subsequent parallel runs - --model: choose model (e.g., sonnet) for faster/cheaper runs Enables: setup once → parallel claude sessions → teardown once. Reduces n=3 batch time from ~30-60min to ~10-15min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iamcxa · 2026-03-24T09:44:10Z

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Batch ID: 20260324-1656 | Model: Sonnet | Isolation: --bare | Adapter: DuckDB

Results

Scenario	Ch	Baseline	With-Plugin	Delta	Cost
ch1-healthy-audit	1	0/4	2/4	+2	$0.74
ch1-null-amounts	1	9/9	9/9	0	$0.84
ch2-amount-misscale	2	11/12	11/12	0	$0.67
ch2-silent-filter	2	11/12	11/12	0	$0.79
ch3-count-distinct	3	12/12	❌ patch fail	—	$0.41
ch3-join-shift	3	9/12	9/12	0	$0.76
ch3-phantom-filter	3	8/12	8/12	0	$0.86

Total: 13/14 runs successful, $5.07 total cost

Key Findings

Option E tool-embedded guidance works in bare mode — with-plugin runs have access to impact_analysis tool with embedded IMPORTANT call-ordering block + _guidance response metadata.
ch3-count-distinct with-plugin failed — patch apply error (previous baseline's orders_daily_summary.sql teardown left modified state). Not an Option E issue — infrastructure bug in sequential batch runner.
ch1-healthy-audit shows plugin value — baseline 0/4 vs with-plugin 2/4. The plugin's impact_analysis correctly reports no data changes, helping the agent produce a cleaner "no issues" report.
ch3-phantom-filter — baseline 8/12 = with-plugin 8/12 (n=1, high variance scenario). Previous n=3 runs showed +2.7 delta. Single-run comparison is noisy.

Configuration

.mcp.json: stdio transport (via run-mcp-stdio.sh wrapper)
SessionStart hook: IMPACT_RULE removed, replaced by tool-embedded guidance
Agent constraint: Python/curl SSE bypass prohibited
Selector default: state:modified.body+ state:modified.macros+ state:modified.contract+

Companion PR

recce#1233 (merged) — tool description + selector narrowing in mcp_server.py
recce#1241 — node_id_by_name UnboundLocalError fix (from Copilot review)

…a patterns Agent was hallucinating data quality issues (e.g., "310 completed orders with $0") because the prompt asked "any data quality issues?" without defining what constitutes an actual pipeline bug vs inherent data patterns. Changes: - Prompt now explicitly asks for PIPELINE BUGS, not general data patterns - Lists known expected patterns (placed orders, partial payments, etc.) - Scoring keywords expanded: added "corrupted", "data loss", "regression" - Judge criteria updated: evaluate bug vs pattern distinction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iamcxa · 2026-03-24T14:17:25Z

Eval Results v3 — Full Suite with Proper Isolation

Batch ID: 20260324-1800 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Previous eval (20260324-1656) was invalid — patch state leaked across scenarios. v3 adds git checkout . && dbt run --full-refresh between every run.

Results

Scenario	Ch	Baseline	With-Plugin	Delta
ch1-healthy-audit	1	2/4	2/4	0
ch1-null-amounts	1	9/9	9/9	0
ch2-amount-misscale	2	11/12	12/12	+1
ch2-silent-filter	2	11/12	11/12	0
ch3-count-distinct	3	12/12	12/12	0
ch3-join-shift	3	9/12	9/12	0
ch3-phantom-filter	3	8/12	11/12	+3
Total		62/73	66/73	+4

Total cost: $7.14 | 14 runs | 0 failures

Key Findings

ch3-phantom-filter +3: Plugin's impact_analysis with embedded guidance (Option E) correctly classifies blast radius. Baseline misses downstream models.
ch2-amount-misscale +1: Plugin achieves perfect 12/12 vs baseline 11/12.
ch1-healthy-audit 2/4 (both): Improved from 0/4 after prompt fix (distinguish pipeline bugs from data patterns). Remaining 2 failures likely due to agents still flagging seed data patterns as issues.
ch3-count-distinct: profiles.yml target mismatch during with-plugin setup (dev-local not found), but agent still scored 12/12 by working around it.

Fixes in This Batch

Scenario isolation: git checkout . && dbt run --full-refresh between every run (fixes patch leakage from v1/v2)
ch1-healthy-audit prompt: Explicitly distinguishes pipeline bugs from inherent data patterns
false_positive_keywords: Expanded to include "corrupted", "data loss", "regression"

iamcxa and others added 2 commits March 24, 2026 16:48

iamcxa self-assigned this Mar 24, 2026

iamcxa marked this pull request as ready for review March 24, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recce): stdio MCP transport + Option E eval improvements#18

feat(recce): stdio MCP transport + Option E eval improvements#18
iamcxa wants to merge 3 commits intomainfrom
feat/plugin-rename-and-recce-dev

iamcxa commented Mar 24, 2026

Uh oh!

iamcxa commented Mar 24, 2026

Uh oh!

iamcxa commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iamcxa commented Mar 24, 2026

Summary

Why stdio over SSE?

Eval validation data

Test plan

Uh oh!

iamcxa commented Mar 24, 2026

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Results

Key Findings

Configuration

Companion PR

Uh oh!

iamcxa commented Mar 24, 2026

Eval Results v3 — Full Suite with Proper Isolation

Results

Key Findings

Fixes in This Batch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant