sourcegraph
diff --git a/‎.cursor/rules/agent-delegation.mdc‎
Lines changed: 139 additions & 0 deletions b/‎.cursor/rules/agent-delegation.mdc‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎.cursor/rules/ccb-analysis.mdc‎
Lines changed: 197 additions & 0 deletions b/‎.cursor/rules/ccb-analysis.mdc‎
Lines changed: 197 additions & 0 deletions
@@ -0,0 +1,139 @@
+---
+description: Agent delegation skills — route and delegate tasks to the best AI agent (Codex, Cursor, Gemini, Copilot) based on task type and complexity. Use when asked to delegate work to other AI agents.
+---
+
+# Delegate Skill
+
+Routes coding tasks to the optimal AI CLI agent and delegates execution.
+
+## Usage
+```
+/delegate <task description>
+```
+
+## Routing
+
+Run the CLI router:
+```bash
+python3 ~/agent-skills/router-service/route_cli.py "<task>"
+```
+
+Options: `--prefer-speed`, `--prefer-cost`, `--compact`, `--exclude codex cursor`
+
+## Agent Capabilities
+
+| Agent | Role | Best For | Model |
+|-------|------|----------|-------|
+| **copilot** | Opus Specialist | Complex reasoning, security analysis, deep reviews | claude-opus-4.5 |
+| **cursor** | Structural Work | Planning, architecture, refactoring, debugging, code review | --model auto |
+| **gemini** | Code Gen Engine | Code generation, research, documentation | gemini-3-pro |
+| **codex** | Misc Coding | Algorithms, utility scripts | gpt-5.2-codex |
+
+## Task Type Routing
+
+| Task Type | Agent | Why |
+|-----------|-------|-----|
+| planning/architecture | Cursor | Plan mode for structured planning |
+| code_review | Cursor | Ask mode for thorough review |
+| security_review | Copilot | Opus for deep security analysis |
+| refactoring | Cursor | Agent mode, repo-aware |
+| debugging | Cursor | Systematic debugging |
+| debugging_complex | Copilot | Race conditions, concurrency |
+| code_generation | Gemini | Primary code gen |
+| research/exploration | Gemini | Codebase analysis |
+| documentation | Gemini | Clear, structured docs |
+| algorithms | Codex | Algorithmic optimization |
+
+## Execution Steps
+1. Route task via CLI
+2. Parse JSON response (selected_agent, confidence, reasoning)
+3. Inform user and IMMEDIATELY delegate via Task tool
+4. Wait for driver agent to complete (handles orchestration)
+5. Return compressed results
+
+## Fallback
+- Default to gemini for research/analysis
+- Default to cursor for code modification
+
+---
+
+# Codex CLI Guide
+
+## Command Templates
+```bash
+export PATH="/opt/homebrew/bin:$HOME/.local/bin:$PATH" && codex exec --skip-git-repo-check --sandbox <MODE> "<prompt>" 2>/dev/null
+```
+
+## Sandbox Modes
+| Task Type | Mode |
+|-----------|------|
+| Analysis, Q&A | `--sandbox read-only` |
+| Create/edit files | `--sandbox workspace-write --full-auto` |
+| Network access | `--sandbox danger-full-access --full-auto` |
+
+## Models
+- Default: `-m gpt-5.2-codex`
+- General: `-m gpt-5.2`
+- Reasoning: `--config model_reasoning_effort="xhigh|high|medium|low"`
+
+---
+
+# Cursor CLI Guide
+
+**The CLI command is `agent`, NOT `cursor`.**
+
+## Command Templates
+```bash
+export PATH="$HOME/.local/bin:$PATH" && agent -p "<prompt>" --model auto --output-format text
+```
+
+## Modes
+- `ask` — read-only analysis
+- `agent` — full file editing
+- `plan` — planning without execution
+- Default: auto-selected by task type
+
+## Models
+- `--model auto` (recommended), `--model gpt-5.2`, `--model gemini-3-flash`, `--model opus-4.5-thinking`
+
+---
+
+# Copilot CLI Guide
+
+## Command Templates
+```bash
+export PATH="$HOME/.local/share/mise/installs/node/24.13.0/bin:$HOME/.local/bin:$PATH" && copilot -p "<prompt>" --allow-all-paths
+```
+
+## Permission Modes
+| Task Type | Flags |
+|-----------|-------|
+| Analysis, Q&A | (none) |
+| File edits | `--allow-all-paths` |
+| URL access | `--allow-all-urls` |
+
+## Models
+- Default: `--model claude-sonnet-4.5`
+- Fast: `--model claude-haiku-4.5`
+- Complex: `--model claude-opus-4.5`
+
+---
+
+# Gemini CLI Guide
+
+## Command Templates
+```bash
+export PATH="$HOME/.local/share/mise/installs/node/24.13.0/bin:$HOME/.local/bin:$PATH" && NODE_OPTIONS="--no-warnings" gemini -p "<prompt>" --yolo
+```
+
+## Modes
+- Read-only: no flags
+- File edits: `--yolo`
+- Controlled: `--approval-mode auto_edit`
+
+## Models
+- Default: `gemini-2.5-pro`
+- Fast: `-m gemini-2.5-flash`
+
+## Session Management
+- `--list-sessions`, `--resume latest`, `--resume <id>`
@@ -0,0 +1,197 @@
+---
+description: CCB analysis skills — compare configs, audit MCP usage, IR quality metrics, cost analysis, and trace evaluation. Use when analyzing benchmark results, comparing configurations, or investigating MCP impact.
+globs:
+  - scripts/compare_configs.py
+  - scripts/mcp_audit.py
+  - scripts/ir_analysis.py
+  - scripts/cost_report.py
+  - scripts/audit_traces.py
+---
+
+# Compare Configs
+
+Compare results between agent configurations to find signal about MCP tool impact.
+
+## Steps
+
+### 1. Run the comparison
+```bash
+cd ~/CodeContextBench && python3 scripts/compare_configs.py --format json
+```
+
+### 2. Present results as tables
+
+**Overall pass rates** by config, **divergence analysis** (stable, all-fail, divergent), and **divergent task detail table**.
+
+Focus on: biggest winner, MCP helps, MCP hurts, all-fail tasks.
+
+### 3. MCP-conditioned analysis (optional)
+
+```bash
+python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null
+```
+
+Separates used-MCP vs zero-MCP tasks. Present reward delta table by intensity bucket.
+
+### Variants
+```bash
+python3 scripts/compare_configs.py --suite ccb_pytorch --format json
+python3 scripts/compare_configs.py --divergent-only --format json
+python3 scripts/compare_configs.py --format table
+```
+
+---
+
+# MCP Audit
+
+Analyze MCP (Sourcegraph) tool usage across benchmark runs.
+
+## What This Does
+
+`scripts/mcp_audit.py`:
+1. Collects `task_metrics.json` from paired_rerun batches
+2. Pairs baseline vs sourcegraph_full tasks
+3. Classifies by MCP usage: zero-MCP vs used-MCP (light/moderate/heavy)
+4. Computes reward and time deltas conditioned on actual MCP usage
+5. Identifies negative flips
+
+## Steps
+
+### 1. Run the audit
+```bash
+cd ~/CodeContextBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null
+```
+
+### 2. Present key findings
+
+Tables: Overview, per-benchmark MCP adoption, reward deltas (used-MCP only), timing deltas.
+
+### 3. Investigate zero-MCP tasks
+
+Classify: trivially local, explicit file list, full local codebase, both configs failed, agent confusion.
+
+### 4. Check for negative flips
+
+Tasks where baseline passes but SG_full fails.
+
+### 5. MCP tool distribution
+
+Show which tools are most/least used.
+
+### 6. Summary and recommendations
+
+MCP value, MCP risk, optimization opportunities, cost-benefit.
+
+### Variants
+```bash
+python3 scripts/mcp_audit.py --all-runs --json --verbose
+python3 scripts/mcp_audit.py --verbose  # text output
+```
+
+### Key Technical Notes
+- Transcript-first extraction: Tool counts from `claude-code.txt`, NOT `trajectory.json`
+- Paired reruns: BL and SF concurrent on same VM
+- MCP tool name variants: `sg_` prefix or not, script handles both
+
+---
+
+# IR Analysis
+
+Measure how well agents find the right files, comparing baseline vs MCP retrieval against ground truth.
+
+## Steps
+
+### 1. Ensure ground truth is built
+```bash
+cd ~/CodeContextBench && python3 scripts/ir_analysis.py --build-ground-truth
+```
+
+### 2. Run the IR analysis
+```bash
+cd ~/CodeContextBench && python3 scripts/ir_analysis.py --json 2>/dev/null
+```
+
+### 3. Present key findings
+
+Per-benchmark IR scores, overall aggregates, statistical tests.
+
+Key metrics: file recall, MRR, context efficiency, P@K.
+
+### Variants
+```bash
+python3 scripts/ir_analysis.py --per-task --json 2>/dev/null
+python3 scripts/ir_analysis.py --suite ccb_swebenchpro 2>/dev/null
+```
+
+### Ground Truth Sources
+
+| Benchmark | Strategy | Confidence |
+|-----------|----------|:----------:|
+| SWE-bench Pro | Patch headers | high |
+| PyTorch | Diff headers | high |
+| K8s Docs | Directory listing | high |
+| Governance/Enterprise | Test script paths | medium |
+| Others | Instruction regex | low |
+
+---
+
+# Cost Report
+
+Analyze token usage and estimated cost across benchmark runs.
+
+## Steps
+```bash
+cd ~/CodeContextBench && python3 scripts/cost_report.py
+```
+
+Shows: total cost/tokens/hours, per suite/config breakdown, config cost comparison, top 10 most expensive tasks.
+
+### Variants
+```bash
+python3 scripts/cost_report.py --suite ccb_pytorch
+python3 scripts/cost_report.py --config sourcegraph_full
+python3 scripts/cost_report.py --format json
+```
+
+---
+
+# Evaluate Traces
+
+Comprehensive evaluation of benchmark run traces: data integrity, output quality, efficiency analysis.
+
+## Phases
+
+### Phase 1: Scope Selection
+- MANIFEST: `runs/official/MANIFEST.json`
+- Audit script: `python3 scripts/audit_traces.py [--json] [--suite X] [--config X]`
+
+### Phase 2: Data Integrity
+- MCP adoption validation (transcript-first, check both `sg_` prefix variants)
+- Baseline contamination check (zero `mcp__sourcegraph` calls)
+- Infrastructure failure detection (zero-token, crash, null-token H3 bug)
+- Dedup integrity
+
+### Phase 3: Output Quality
+- Per-suite reward analysis
+- Cross-config comparison (matched tasks)
+- Task-level quality patterns (MCP helps/hurts/neutral)
+
+### Phase 4: Efficiency
+- Token usage and cost estimates
+- Wall clock time deltas
+- MCP tool distribution
+- Cost-effectiveness ratios
+
+### Phase 5: Synthesis
+Write report to `docs/TRACE_AUDIT_<date>.md`.
+
+## Known Patterns
+1. Zero-token (int 0) = auth failures
+2. Null-token + no trajectory + <=5 lines = crash failures
+3. Null-token + valid rewards = H3 token-logging bug (not failures)
+4. MCP distraction on TAC
+5. Deep Search unused (~1%)
+6. SWE-Perf regression under SG_base
+7. Subagent MCP calls hidden in trajectory.json (only in claude-code.txt)
+8. Zero-MCP is ~80% rational
+9. Monotonic MCP intensity-reward: Light +2.2%, Moderate +3.6%, Heavy +6.1%