You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+27-22Lines changed: 27 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,14 @@
1
1
# CodeScaleBench
2
2
3
-
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper *"CodeScaleBench: Evaluating Coding Agents on Real-Scale Software Engineering Tasks Across the Development Lifecycle."*
3
+
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.
4
4
5
-
This repository contains **benchmark task definitions**, **evaluation configs**, and a **metrics extraction pipeline**. Tasks are executed via the [Harbor](https://github.com/laude-institute/harbor/tree/main) runner with the Claude Code agent harness.
5
+
This repository contains:
6
+
-**Benchmark task definitions** (SDLC and Org suites with task specs, tests, and metadata)
7
+
-**Evaluation and run configs** (paired baseline vs MCP-enabled execution modes)
8
+
-**Metrics extraction and reporting pipelines** for score/cost/retrieval analysis
9
+
-**Run artifacts and agent traces** (in `runs/` and published summaries under `docs/official_results/`)
10
+
11
+
Tasks are executed via the [Harbor](https://github.com/laude-institute/harbor/tree/main) runner with the Claude Code agent harness.
**Combined canonical benchmark: 370 tasks** (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in `benchmarks/backups/`.
@@ -110,16 +115,16 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
110
115
111
116
## 2-Config Evaluation Matrix
112
117
113
-
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
118
+
All benchmarks are evaluated across two primary configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
119
124
120
-
At the paper level, the distinction is still:
125
+
At a high level, the distinction is:
121
126
122
-
|Paper Config Name | Internal MCP mode | MCP Tools Available |
127
+
| Config Name | Internal MCP mode | MCP Tools Available |
Copy file name to clipboardExpand all lines: docs/CONTROL_PLANE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,7 +103,7 @@ Either way, the **control plane** is the spec + manifest; the runner is a consum
103
103
104
104
## Relation to existing v2 experiment YAMLs
105
105
106
-
The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
106
+
The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `scripts/run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
107
107
108
108
The **control plane layer** described here is complementary:
0 commit comments