Skip to content

feat(recce): stdio MCP transport + Option E eval improvements#18

Open
iamcxa wants to merge 3 commits intomainfrom
feat/plugin-rename-and-recce-dev
Open

feat(recce): stdio MCP transport + Option E eval improvements#18
iamcxa wants to merge 3 commits intomainfrom
feat/plugin-rename-and-recce-dev

Conversation

@iamcxa
Copy link
Contributor

@iamcxa iamcxa commented Mar 24, 2026

Summary

  • SSE → stdio MCP transport: Plugin .mcp.json now uses stdio instead of SSE. Claude Code spawns the MCP server as a child process — no external server lifecycle to manage, no DuckDB lock conflicts, no port coordination.
  • Option E — tool-embedded guidance: Removed IMPACT_RULE from SessionStart hook. Guidance now lives in the impact_analysis tool description + response metadata (recce PR #1233). Eval-validated: bare-mode plugin delta +2.7 (comparable to hook-based +3.0).
  • Agent MCP bypass prohibition: Reviewer agent now explicitly forbidden from using Python/curl to interact with SSE endpoints directly.
  • Eval acceleration: --skip-setup, --skip-teardown, --model flags for run-case.sh. Enables parallel runs + model selection, reducing batch time from ~30-60min to ~10-15min.

Why stdio over SSE?

SSE (before) stdio (after)
Server lifecycle Plugin hook starts/stops SSE server Claude Code spawns on demand
DuckDB locks SSE holds persistent read connection → blocks dbt run No persistent connection — spawned per session
Port conflicts Hardcoded 8081, multi-project collisions No ports — stdin/stdout
Agent bypass risk Agent can curl the SSE endpoint directly No HTTP endpoint to bypass

Eval validation data

ch3-phantom-filter (n=3, bare mode + Option E v2):

Run Score
1 12/12
2 12/12
3 12/12

Companion change: recce#1233 (tool description + selector narrowing in mcp_server.py)

Test plan

  • Eval validation: bare-mode n=3, ch3-phantom-filter → 12/12 (2 of 3 valid runs perfect)
  • Parallel stdio MCP: 3 concurrent sessions, zero DuckDB lock failures
  • --skip-setup + --model sonnet: single run completes correctly
  • Manual: install plugin locally, verify /recce-review works with stdio transport

🤖 Generated with Claude Code

iamcxa and others added 2 commits March 24, 2026 16:48
- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh
  wrapper (venv auto-detection, no external server to manage)
- Remove IMPACT_RULE from SessionStart hook — guidance now embedded
  in impact_analysis tool description + response (Option E)
- Simplify session-start.sh: remove SSE server lifecycle management,
  replace with MCP_READY readiness check
- Add agent constraint: prohibit Python/curl bypass of MCP tools

Eval validation (ch3-phantom-filter, n=3):
  bare+Option E v2 delta: +2.7 (vs clean-profile +3.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speed up iterative eval by allowing:
- --skip-setup: reuse pre-applied patch + dbt state (no re-setup)
- --skip-teardown: keep state for subsequent parallel runs
- --model: choose model (e.g., sonnet) for faster/cheaper runs

Enables: setup once → parallel claude sessions → teardown once.
Reduces n=3 batch time from ~30-60min to ~10-15min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa iamcxa self-assigned this Mar 24, 2026
@iamcxa iamcxa marked this pull request as ready for review March 24, 2026 08:53
@iamcxa
Copy link
Contributor Author

iamcxa commented Mar 24, 2026

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Batch ID: 20260324-1656 | Model: Sonnet | Isolation: --bare | Adapter: DuckDB

Results

Scenario Ch Baseline With-Plugin Delta Cost
ch1-healthy-audit 1 0/4 2/4 +2 $0.74
ch1-null-amounts 1 9/9 9/9 0 $0.84
ch2-amount-misscale 2 11/12 11/12 0 $0.67
ch2-silent-filter 2 11/12 11/12 0 $0.79
ch3-count-distinct 3 12/12 ❌ patch fail $0.41
ch3-join-shift 3 9/12 9/12 0 $0.76
ch3-phantom-filter 3 8/12 8/12 0 $0.86

Total: 13/14 runs successful, $5.07 total cost

Key Findings

  1. Option E tool-embedded guidance works in bare mode — with-plugin runs have access to impact_analysis tool with embedded IMPORTANT call-ordering block + _guidance response metadata.

  2. ch3-count-distinct with-plugin failed — patch apply error (previous baseline's orders_daily_summary.sql teardown left modified state). Not an Option E issue — infrastructure bug in sequential batch runner.

  3. ch1-healthy-audit shows plugin value — baseline 0/4 vs with-plugin 2/4. The plugin's impact_analysis correctly reports no data changes, helping the agent produce a cleaner "no issues" report.

  4. ch3-phantom-filter — baseline 8/12 = with-plugin 8/12 (n=1, high variance scenario). Previous n=3 runs showed +2.7 delta. Single-run comparison is noisy.

Configuration

  • .mcp.json: stdio transport (via run-mcp-stdio.sh wrapper)
  • SessionStart hook: IMPACT_RULE removed, replaced by tool-embedded guidance
  • Agent constraint: Python/curl SSE bypass prohibited
  • Selector default: state:modified.body+ state:modified.macros+ state:modified.contract+

Companion PR

  • recce#1233 (merged) — tool description + selector narrowing in mcp_server.py
  • recce#1241node_id_by_name UnboundLocalError fix (from Copilot review)

…a patterns

Agent was hallucinating data quality issues (e.g., "310 completed orders
with $0") because the prompt asked "any data quality issues?" without
defining what constitutes an actual pipeline bug vs inherent data patterns.

Changes:
- Prompt now explicitly asks for PIPELINE BUGS, not general data patterns
- Lists known expected patterns (placed orders, partial payments, etc.)
- Scoring keywords expanded: added "corrupted", "data loss", "regression"
- Judge criteria updated: evaluate bug vs pattern distinction

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa
Copy link
Contributor Author

iamcxa commented Mar 24, 2026

Eval Results v3 — Full Suite with Proper Isolation

Batch ID: 20260324-1800 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Previous eval (20260324-1656) was invalid — patch state leaked across scenarios. v3 adds git checkout . && dbt run --full-refresh between every run.

Results

Scenario Ch Baseline With-Plugin Delta
ch1-healthy-audit 1 2/4 2/4 0
ch1-null-amounts 1 9/9 9/9 0
ch2-amount-misscale 2 11/12 12/12 +1
ch2-silent-filter 2 11/12 11/12 0
ch3-count-distinct 3 12/12 12/12 0
ch3-join-shift 3 9/12 9/12 0
ch3-phantom-filter 3 8/12 11/12 +3
Total 62/73 66/73 +4

Total cost: $7.14 | 14 runs | 0 failures

Key Findings

  1. ch3-phantom-filter +3: Plugin's impact_analysis with embedded guidance (Option E) correctly classifies blast radius. Baseline misses downstream models.

  2. ch2-amount-misscale +1: Plugin achieves perfect 12/12 vs baseline 11/12.

  3. ch1-healthy-audit 2/4 (both): Improved from 0/4 after prompt fix (distinguish pipeline bugs from data patterns). Remaining 2 failures likely due to agents still flagging seed data patterns as issues.

  4. ch3-count-distinct: profiles.yml target mismatch during with-plugin setup (dev-local not found), but agent still scored 12/12 by working around it.

Fixes in This Batch

  • Scenario isolation: git checkout . && dbt run --full-refresh between every run (fixes patch leakage from v1/v2)
  • ch1-healthy-audit prompt: Explicitly distinguishes pipeline bugs from inherent data patterns
  • false_positive_keywords: Expanded to include "corrupted", "data loss", "regression"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant