|
| 1 | +--- |
| 2 | +description: CCB analysis skills — compare configs, audit MCP usage, IR quality metrics, cost analysis, and trace evaluation. Use when analyzing benchmark results, comparing configurations, or investigating MCP impact. |
| 3 | +globs: |
| 4 | + - scripts/compare_configs.py |
| 5 | + - scripts/mcp_audit.py |
| 6 | + - scripts/ir_analysis.py |
| 7 | + - scripts/cost_report.py |
| 8 | + - scripts/audit_traces.py |
| 9 | +--- |
| 10 | + |
| 11 | +# Compare Configs |
| 12 | + |
| 13 | +Compare results between agent configurations to find signal about MCP tool impact. |
| 14 | + |
| 15 | +## Steps |
| 16 | + |
| 17 | +### 1. Run the comparison |
| 18 | +```bash |
| 19 | +cd ~/CodeContextBench && python3 scripts/compare_configs.py --format json |
| 20 | +``` |
| 21 | + |
| 22 | +### 2. Present results as tables |
| 23 | + |
| 24 | +**Overall pass rates** by config, **divergence analysis** (stable, all-fail, divergent), and **divergent task detail table**. |
| 25 | + |
| 26 | +Focus on: biggest winner, MCP helps, MCP hurts, all-fail tasks. |
| 27 | + |
| 28 | +### 3. MCP-conditioned analysis (optional) |
| 29 | + |
| 30 | +```bash |
| 31 | +python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null |
| 32 | +``` |
| 33 | + |
| 34 | +Separates used-MCP vs zero-MCP tasks. Present reward delta table by intensity bucket. |
| 35 | + |
| 36 | +### Variants |
| 37 | +```bash |
| 38 | +python3 scripts/compare_configs.py --suite ccb_pytorch --format json |
| 39 | +python3 scripts/compare_configs.py --divergent-only --format json |
| 40 | +python3 scripts/compare_configs.py --format table |
| 41 | +``` |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +# MCP Audit |
| 46 | + |
| 47 | +Analyze MCP (Sourcegraph) tool usage across benchmark runs. |
| 48 | + |
| 49 | +## What This Does |
| 50 | + |
| 51 | +`scripts/mcp_audit.py`: |
| 52 | +1. Collects `task_metrics.json` from paired_rerun batches |
| 53 | +2. Pairs baseline vs sourcegraph_full tasks |
| 54 | +3. Classifies by MCP usage: zero-MCP vs used-MCP (light/moderate/heavy) |
| 55 | +4. Computes reward and time deltas conditioned on actual MCP usage |
| 56 | +5. Identifies negative flips |
| 57 | + |
| 58 | +## Steps |
| 59 | + |
| 60 | +### 1. Run the audit |
| 61 | +```bash |
| 62 | +cd ~/CodeContextBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null |
| 63 | +``` |
| 64 | + |
| 65 | +### 2. Present key findings |
| 66 | + |
| 67 | +Tables: Overview, per-benchmark MCP adoption, reward deltas (used-MCP only), timing deltas. |
| 68 | + |
| 69 | +### 3. Investigate zero-MCP tasks |
| 70 | + |
| 71 | +Classify: trivially local, explicit file list, full local codebase, both configs failed, agent confusion. |
| 72 | + |
| 73 | +### 4. Check for negative flips |
| 74 | + |
| 75 | +Tasks where baseline passes but SG_full fails. |
| 76 | + |
| 77 | +### 5. MCP tool distribution |
| 78 | + |
| 79 | +Show which tools are most/least used. |
| 80 | + |
| 81 | +### 6. Summary and recommendations |
| 82 | + |
| 83 | +MCP value, MCP risk, optimization opportunities, cost-benefit. |
| 84 | + |
| 85 | +### Variants |
| 86 | +```bash |
| 87 | +python3 scripts/mcp_audit.py --all-runs --json --verbose |
| 88 | +python3 scripts/mcp_audit.py --verbose # text output |
| 89 | +``` |
| 90 | + |
| 91 | +### Key Technical Notes |
| 92 | +- Transcript-first extraction: Tool counts from `claude-code.txt`, NOT `trajectory.json` |
| 93 | +- Paired reruns: BL and SF concurrent on same VM |
| 94 | +- MCP tool name variants: `sg_` prefix or not, script handles both |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +# IR Analysis |
| 99 | + |
| 100 | +Measure how well agents find the right files, comparing baseline vs MCP retrieval against ground truth. |
| 101 | + |
| 102 | +## Steps |
| 103 | + |
| 104 | +### 1. Ensure ground truth is built |
| 105 | +```bash |
| 106 | +cd ~/CodeContextBench && python3 scripts/ir_analysis.py --build-ground-truth |
| 107 | +``` |
| 108 | + |
| 109 | +### 2. Run the IR analysis |
| 110 | +```bash |
| 111 | +cd ~/CodeContextBench && python3 scripts/ir_analysis.py --json 2>/dev/null |
| 112 | +``` |
| 113 | + |
| 114 | +### 3. Present key findings |
| 115 | + |
| 116 | +Per-benchmark IR scores, overall aggregates, statistical tests. |
| 117 | + |
| 118 | +Key metrics: file recall, MRR, context efficiency, P@K. |
| 119 | + |
| 120 | +### Variants |
| 121 | +```bash |
| 122 | +python3 scripts/ir_analysis.py --per-task --json 2>/dev/null |
| 123 | +python3 scripts/ir_analysis.py --suite ccb_swebenchpro 2>/dev/null |
| 124 | +``` |
| 125 | + |
| 126 | +### Ground Truth Sources |
| 127 | + |
| 128 | +| Benchmark | Strategy | Confidence | |
| 129 | +|-----------|----------|:----------:| |
| 130 | +| SWE-bench Pro | Patch headers | high | |
| 131 | +| PyTorch | Diff headers | high | |
| 132 | +| K8s Docs | Directory listing | high | |
| 133 | +| Governance/Enterprise | Test script paths | medium | |
| 134 | +| Others | Instruction regex | low | |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +# Cost Report |
| 139 | + |
| 140 | +Analyze token usage and estimated cost across benchmark runs. |
| 141 | + |
| 142 | +## Steps |
| 143 | +```bash |
| 144 | +cd ~/CodeContextBench && python3 scripts/cost_report.py |
| 145 | +``` |
| 146 | + |
| 147 | +Shows: total cost/tokens/hours, per suite/config breakdown, config cost comparison, top 10 most expensive tasks. |
| 148 | + |
| 149 | +### Variants |
| 150 | +```bash |
| 151 | +python3 scripts/cost_report.py --suite ccb_pytorch |
| 152 | +python3 scripts/cost_report.py --config sourcegraph_full |
| 153 | +python3 scripts/cost_report.py --format json |
| 154 | +``` |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +# Evaluate Traces |
| 159 | + |
| 160 | +Comprehensive evaluation of benchmark run traces: data integrity, output quality, efficiency analysis. |
| 161 | + |
| 162 | +## Phases |
| 163 | + |
| 164 | +### Phase 1: Scope Selection |
| 165 | +- MANIFEST: `runs/official/MANIFEST.json` |
| 166 | +- Audit script: `python3 scripts/audit_traces.py [--json] [--suite X] [--config X]` |
| 167 | + |
| 168 | +### Phase 2: Data Integrity |
| 169 | +- MCP adoption validation (transcript-first, check both `sg_` prefix variants) |
| 170 | +- Baseline contamination check (zero `mcp__sourcegraph` calls) |
| 171 | +- Infrastructure failure detection (zero-token, crash, null-token H3 bug) |
| 172 | +- Dedup integrity |
| 173 | + |
| 174 | +### Phase 3: Output Quality |
| 175 | +- Per-suite reward analysis |
| 176 | +- Cross-config comparison (matched tasks) |
| 177 | +- Task-level quality patterns (MCP helps/hurts/neutral) |
| 178 | + |
| 179 | +### Phase 4: Efficiency |
| 180 | +- Token usage and cost estimates |
| 181 | +- Wall clock time deltas |
| 182 | +- MCP tool distribution |
| 183 | +- Cost-effectiveness ratios |
| 184 | + |
| 185 | +### Phase 5: Synthesis |
| 186 | +Write report to `docs/TRACE_AUDIT_<date>.md`. |
| 187 | + |
| 188 | +## Known Patterns |
| 189 | +1. Zero-token (int 0) = auth failures |
| 190 | +2. Null-token + no trajectory + <=5 lines = crash failures |
| 191 | +3. Null-token + valid rewards = H3 token-logging bug (not failures) |
| 192 | +4. MCP distraction on TAC |
| 193 | +5. Deep Search unused (~1%) |
| 194 | +6. SWE-Perf regression under SG_base |
| 195 | +7. Subagent MCP calls hidden in trajectory.json (only in claude-code.txt) |
| 196 | +8. Zero-MCP is ~80% rational |
| 197 | +9. Monotonic MCP intensity-reward: Light +2.2%, Moderate +3.6%, Heavy +6.1% |
0 commit comments