Skip to content

Commit f361d91

Browse files
sjarmakclaude
andcommitted
Promote openlibrary-solr-boolean-fix-001 baseline, fix report metrics
- Promote ccb_fix_haiku_20260227_151833 from staging to official (openlibrary-solr-boolean-fix-001 baseline: reward=0.0, was null/error) - Regenerate MANIFEST.json (251 valid pairs confirmed) - Fix abstract SDLC delta: -0.019 → -0.015 (matched bootstrap output) - Fix task count reference: 250 → 251 - Bootstrap CIs verified exact: overall +0.049 [+0.010, +0.088], SDLC -0.015 [-0.059, +0.029], MCP-unique +0.183 [+0.116, +0.255] - IR evaluation unaffected (zero retrieval events on both sides) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 60395b3 commit f361d91

File tree

24 files changed

+9655
-16
lines changed

24 files changed

+9655
-16
lines changed

docs/official_results/audits/ccb_fix_haiku_20260227_151833--baseline-local-direct--openlibrary-solr-boolean-fix-001.json

Lines changed: 2341 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# ccb_fix_haiku_20260227_151833
2+
3+
## baseline-local-direct
4+
5+
- Valid tasks: `1`
6+
- Mean reward: `0.000`
7+
- Pass rate: `0.000`
8+
9+
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
10+
|---|---|---:|---:|---:|---|
11+
| [openlibrary-solr-boolean-fix-001](../tasks/ccb_fix_haiku_20260227_151833--baseline-local-direct--openlibrary-solr-boolean-fix-001.html) | `failed` | 0.000 | 0.000 | 51 | traj, tx |

docs/technical_reports/TECHNICAL_REPORT_V1.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
# CodeContextBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance
22

3-
**White Paper Technical Report**
3+
**Canonical Technical Report (Source of Truth)**
44
**Date:** February 27, 2026
55

6+
> Canonical source policy: This document (`docs/technical_reports/TECHNICAL_REPORT_V1.md`)
7+
> is the authoritative source for technical report updates. Any white-paper or
8+
> presentation variants (including `docs/WHITE_PAPER_REPORT_V2.md`) should be
9+
> treated as derived artifacts synchronized from this report.
10+
611
---
712

813
## Abstract
914

10-
CodeContextBench (CCB) is a benchmark suite of 251 software engineering tasks spanning the full Software Development Lifecycle (SDLC) designed to measure whether external code intelligence tools -- specifically Sourcegraph's Model Context Protocol (MCP) tools -- improve AI coding agent performance. The benchmark evaluates agents under two controlled conditions: a baseline with full local source code and no external tools, and an MCP-augmented configuration where source code is unavailable locally and the agent must use remote code intelligence tools (semantic search, symbol resolution, dependency tracing, etc.) to navigate codebases. Across all 251 paired task evaluations using Claude Haiku 4.5, the overall MCP effect is +0.049 (95% bootstrap CI: [+0.010, +0.088]) — a small but statistically significant positive. The effect is strongly task-dependent: MCP-unique cross-repository discovery tasks show +0.183, while SDLC tasks with full local code show -0.019 (not significant). This report documents the complete design, construction, information retrieval evaluation pipeline, task curation methodology, ground truth and verifier architecture, and findings from the benchmark's execution.
15+
CodeContextBench (CCB) is a benchmark suite of 251 software engineering tasks spanning the full Software Development Lifecycle (SDLC) designed to measure whether external code intelligence tools -- specifically Sourcegraph's Model Context Protocol (MCP) tools -- improve AI coding agent performance. The benchmark evaluates agents under two controlled conditions: a baseline with full local source code and no external tools, and an MCP-augmented configuration where source code is unavailable locally and the agent must use remote code intelligence tools (semantic search, symbol resolution, dependency tracing, etc.) to navigate codebases. Across all 251 paired task evaluations using Claude Haiku 4.5, the overall MCP effect is +0.049 (95% bootstrap CI: [+0.010, +0.088]) — a small but statistically significant positive. The effect is strongly task-dependent: MCP-unique cross-repository discovery tasks show +0.183, while SDLC tasks with full local code show -0.015 (not significant). This report documents the complete design, construction, information retrieval evaluation pipeline, task curation methodology, ground truth and verifier architecture, and findings from the benchmark's execution.
1116

1217
---
1318

@@ -918,11 +923,11 @@ C tasks have the highest mean reward (0.801), driven by the Linux kernel fault l
918923
919924
### 11.4 Reward by Difficulty
920925
921-
| Difficulty | n | Baseline Mean | Pass Rate |
922-
|-----------|---|--------------|-----------|
923-
| Medium | 26 | 0.592 | 69.2% |
924-
| Hard | 145 | 0.628 | 86.9% |
925-
| Expert | 5 | 0.800 | 100.0% |
926+
| Difficulty | n | Baseline Mean | MCP Mean | Pass Rate |
927+
|-----------|---|--------------|----------|-----------|
928+
| Medium | 26 | 0.592 | 0.667 | 69.2% |
929+
| Hard | 145 | 0.628 | 0.687 | 86.9% |
930+
| Expert | 5 | 0.800 | 0.800 | 100.0% |
926931
927932
The counterintuitive result that "hard" tasks outperform "medium" tasks reflects that difficulty ratings were assigned based on expected human effort, not agent capability. Difficulty is a task-authoring metadata field (`task.toml` / selection registry `difficulty`) set from the anticipated human effort and coordination complexity of the scenario, rather than calibrated to current model behavior. Expert tasks (all Linux kernel fault localization) score highest because they are well-structured pattern-matching problems that agents handle effectively despite the large codebase scale.
928933
@@ -1018,7 +1023,7 @@ Analysis of tool call patterns across 213 MCP task runs:
10181023
| understand | 21 | 25.7 | 8.6 | 0.718 | 6.9 | 0.1 |
10191024
| mcp_unique | 37 | 20.7 | 1.6 | 0.918 | 9.2 | 1.0 |
10201025
1021-
The **fix** suite has the lowest MCP ratio (0.350) and highest local call count (39.8), reflecting that bug-fixing tasks require extensive local code editing after initial search. **Document** and **mcp_unique** suites have the highest MCP ratios (0.839 and 0.918 respectively), as these tasks are primarily about information retrieval rather than code modification. The near-total absence of Deep Search calls across all suites confirms that agents default to keyword search and rarely invoke the more expensive semantic analysis tools without explicit preamble guidance. Note: MCP tool usage statistics are drawn from the subset of MCP runs with extractable transcripts (n=213) and may not cover all 250 valid paired tasks.
1026+
The **fix** suite has the lowest MCP ratio (0.350) and highest local call count (39.8), reflecting that bug-fixing tasks require extensive local code editing after initial search. **Document** and **mcp_unique** suites have the highest MCP ratios (0.839 and 0.918 respectively), as these tasks are primarily about information retrieval rather than code modification. The near-total absence of Deep Search calls across all suites confirms that agents default to keyword search and rarely invoke the more expensive semantic analysis tools without explicit preamble guidance. Note: MCP tool usage statistics are drawn from the subset of MCP runs with extractable transcripts (n=213) and may not cover all 251 valid paired tasks.
10221027
10231028
**Reward--MCP correlation:** Spearman rho between MCP ratio and reward is **+0.293** in the analyzed paired slice, indicating a weak positive correlation — higher MCP tool usage is modestly associated with better outcomes, but the relationship is not strong enough to imply causation.
10241029

runs/official/MANIFEST.json

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"description": "Canonical run manifest for CodeContextBench evaluation",
3-
"generated": "2026-02-27T13:45:04.194332+00:00",
3+
"generated": "2026-02-27T17:19:28.739563+00:00",
44
"total_tasks": 765,
55
"total_runs": 82,
66
"runs": {
@@ -3592,9 +3592,9 @@
35923592
"timestamp": "2026-02-24 20-41-47",
35933593
"task_count": 25,
35943594
"passed": 16,
3595-
"failed": 8,
3596-
"errored": 1,
3597-
"mean_reward": 0.499,
3595+
"failed": 9,
3596+
"errored": 0,
3597+
"mean_reward": 0.479,
35983598
"tasks": {
35993599
"ansible-abc-imports-fix-001": {
36003600
"status": "passed",
@@ -3777,10 +3777,10 @@
37773777
"judge_confidence": null
37783778
},
37793779
"openlibrary-solr-boolean-fix-001": {
3780-
"status": "errored",
3780+
"status": "failed",
37813781
"reward": 0.0,
3782-
"has_trajectory": false,
3783-
"has_cost": false,
3782+
"has_trajectory": true,
3783+
"has_cost": true,
37843784
"judge_score": null,
37853785
"judge_model": null,
37863786
"judge_dimensions": null,
@@ -13843,7 +13843,7 @@
1384313843
]
1384413844
},
1384513845
"openlibrary-solr-boolean-fix-001": {
13846-
"n_runs": 2,
13846+
"n_runs": 3,
1384713847
"mean_reward": 0.0,
1384813848
"std_reward": 0.0,
1384913849
"runs": [
@@ -13860,6 +13860,13 @@
1386013860
"status": "failed",
1386113861
"is_paired": false,
1386213862
"run_dir": "fix_haiku_20260223_171232"
13863+
},
13864+
{
13865+
"started_at": "2026-02-27T15:18:42.341451",
13866+
"reward": 0.0,
13867+
"status": "failed",
13868+
"is_paired": false,
13869+
"run_dir": "ccb_fix_haiku_20260227_151833"
1386313870
}
1386413871
]
1386513872
},
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
{
2+
"job_name": "2026-02-27__15-18-42",
3+
"jobs_dir": "runs/staging/ccb_fix_haiku_20260227_151833/baseline-local-direct",
4+
"n_attempts": 1,
5+
"timeout_multiplier": 10.0,
6+
"debug": false,
7+
"orchestrator": {
8+
"type": "local",
9+
"n_concurrent_trials": 1,
10+
"quiet": false,
11+
"retry": {
12+
"max_retries": 0,
13+
"include_exceptions": null,
14+
"exclude_exceptions": [
15+
"VerifierTimeoutError",
16+
"RewardFileEmptyError",
17+
"RewardFileNotFoundError",
18+
"AgentTimeoutError",
19+
"VerifierOutputParseError"
20+
],
21+
"wait_multiplier": 1.0,
22+
"min_wait_sec": 1.0,
23+
"max_wait_sec": 60.0
24+
},
25+
"kwargs": {}
26+
},
27+
"environment": {
28+
"type": "docker",
29+
"import_path": null,
30+
"force_build": false,
31+
"delete": true,
32+
"override_cpus": null,
33+
"override_memory_mb": null,
34+
"override_storage_mb": null,
35+
"override_gpus": null,
36+
"kwargs": {}
37+
},
38+
"verifier": {
39+
"override_timeout_sec": null,
40+
"max_timeout_sec": null,
41+
"disable": false
42+
},
43+
"metrics": [],
44+
"agents": [
45+
{
46+
"name": null,
47+
"import_path": "agents.claude_baseline_agent:BaselineClaudeCodeAgent",
48+
"model_name": "anthropic/claude-haiku-4-5-20251001",
49+
"override_timeout_sec": null,
50+
"override_setup_timeout_sec": null,
51+
"max_timeout_sec": null,
52+
"kwargs": {}
53+
}
54+
],
55+
"datasets": [],
56+
"tasks": [
57+
{
58+
"path": "/home/stephanie_jarmak/CodeContextBench/configs/../benchmarks/ccb_fix/openlibrary-solr-boolean-fix-001",
59+
"git_url": null,
60+
"git_commit_id": null,
61+
"overwrite": false,
62+
"download_dir": null,
63+
"source": null
64+
}
65+
]
66+
}

0 commit comments

Comments
 (0)