Skip to content

Commit 8875459

Browse files
committed
docs: remove contributor docs and move run-eval into scripts
1 parent a19f83e commit 8875459

File tree

8 files changed

+31
-524
lines changed

8 files changed

+31
-524
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ reports/
4141
eval_reports/
4242
tmp/
4343
*.log
44+
ds
4445

4546
# Large external datasets (reference by symlink or variable)
4647
ccb_crossrepo/

CODE_OF_CONDUCT.md

Lines changed: 0 additions & 20 deletions
This file was deleted.

CONTRIBUTING.md

Lines changed: 0 additions & 40 deletions
This file was deleted.

README.md

Lines changed: 27 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,14 @@
11
# CodeScaleBench
22

3-
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper *"CodeScaleBench: Evaluating Coding Agents on Real-Scale Software Engineering Tasks Across the Development Lifecycle."*
3+
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.
44

5-
This repository contains **benchmark task definitions**, **evaluation configs**, and a **metrics extraction pipeline**. Tasks are executed via the [Harbor](https://github.com/laude-institute/harbor/tree/main) runner with the Claude Code agent harness.
5+
This repository contains:
6+
- **Benchmark task definitions** (SDLC and Org suites with task specs, tests, and metadata)
7+
- **Evaluation and run configs** (paired baseline vs MCP-enabled execution modes)
8+
- **Metrics extraction and reporting pipelines** for score/cost/retrieval analysis
9+
- **Run artifacts and agent traces** (in `runs/` and published summaries under `docs/official_results/`)
10+
11+
Tasks are executed via the [Harbor](https://github.com/laude-institute/harbor/tree/main) runner with the Claude Code agent harness.
612

713
---
814

@@ -12,7 +18,6 @@ This repository contains **benchmark task definitions**, **evaluation configs**,
1218

1319
- Researchers evaluating coding agents on realistic software engineering tasks
1420
- Practitioners comparing baseline vs MCP-enabled agent configurations
15-
- Contributors authoring new benchmark tasks or extending evaluation tooling
1621

1722
### What you can do without Harbor
1823

@@ -37,10 +42,11 @@ ls benchmarks
3742
Running benchmark tasks requires:
3843

3944
- [Harbor](https://github.com/laude-institute/harbor/tree/main) installed and configured
40-
- **Daytona** account and API key (preferred — no local Docker needed, up to 125 concurrent sandboxes). See `docs/DAYTONA.md`
41-
- OR Docker (only needed for 21 sweap-images tasks incompatible with Daytona)
42-
- Valid agent/runtime credentials used by your Harbor setup
43-
- A Max subscription (for the default harness path documented in this repo)
45+
46+
Our internal default setup often uses:
47+
- **Daytona** account and API key (preferred in this repo). See `docs/DAYTONA.md`
48+
- Docker for Daytona-incompatible tasks
49+
- Agent/runtime credentials as needed by your Harbor harness
4450

4551
Recommended pre-run checks:
4652

@@ -60,7 +66,6 @@ bash configs/run_selected_tasks.sh --dry-run
6066
- `docs/START_HERE_BY_TASK.md` for task-oriented navigation
6167
- `docs/reference/CONFIGS.md` for the 2-config evaluation matrix
6268
- `docs/EVALUATION_PIPELINE.md` for scoring and reporting outputs
63-
- `docs/REPO_HEALTH.md` for the pre-push health gate
6469

6570
---
6671

@@ -87,17 +92,17 @@ Eleven additional suites measure cross-repo discovery, symbol resolution, depend
8792

8893
| Suite | Category | Tasks | Description |
8994
|-------|----------|------:|-------------|
90-
| `csb_org_onboarding` | E: Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
91-
| `csb_org_migration` | C: Framework Migration | 26 | API migrations, breaking changes across repos |
92-
| `csb_org_security` | B: Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
93-
| `csb_org_crossrepo_tracing` | A: Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
94-
| `csb_org_domain` | H: Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
95-
| `csb_org_incident` | D: Incident Debugging | 20 | Error-to-code-path tracing across microservices |
96-
| `csb_org_compliance` | F: Compliance | 18 | Standards adherence, audit, and provenance workflows |
97-
| `csb_org_platform` | J: Platform Knowledge | 18 | Service template discovery and tribal knowledge |
98-
| `csb_org_crossorg` | G: Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
99-
| `csb_org_org` | I: Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
100-
| `csb_org_crossrepo` | K: Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
95+
| `csb_org_onboarding` | Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
96+
| `csb_org_migration` | Framework Migration | 26 | API migrations, breaking changes across repos |
97+
| `csb_org_security` | Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
98+
| `csb_org_crossrepo_tracing` | Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
99+
| `csb_org_domain` | Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
100+
| `csb_org_incident` | Incident Debugging | 20 | Error-to-code-path tracing across microservices |
101+
| `csb_org_compliance` | Compliance | 18 | Standards adherence, audit, and provenance workflows |
102+
| `csb_org_platform` | Platform Knowledge | 18 | Service template discovery and tribal knowledge |
103+
| `csb_org_crossorg` | Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
104+
| `csb_org_org` | Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
105+
| `csb_org_crossrepo` | Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
101106
| **Total** | | **220** | |
102107

103108
**Combined canonical benchmark: 370 tasks** (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in `benchmarks/backups/`.
@@ -110,16 +115,16 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
110115

111116
## 2-Config Evaluation Matrix
112117

113-
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
118+
All benchmarks are evaluated across two primary configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
114119

115120
- **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
116121
- **Org suites** (`csb_org_*`): `baseline-local-direct` + `mcp-remote-direct` (some legacy runs used `baseline-local-artifact` + `mcp-remote-artifact`)
117122

118123
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
119124

120-
At the paper level, the distinction is still:
125+
At a high level, the distinction is:
121126

122-
| Paper Config Name | Internal MCP mode | MCP Tools Available |
127+
| Config Name | Internal MCP mode | MCP Tools Available |
123128
|-------------------|---------------------|---------------------|
124129
| Baseline | `none` | None (agent uses only built-in tools) |
125130
| MCP-Full | `sourcegraph_full` / `artifact_full` (task-dependent) | All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read` |

SECURITY.md

Lines changed: 0 additions & 21 deletions
This file was deleted.

docs/CONTROL_PLANE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ Either way, the **control plane** is the spec + manifest; the runner is a consum
103103

104104
## Relation to existing v2 experiment YAMLs
105105

106-
The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
106+
The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `scripts/run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
107107

108108
The **control plane layer** described here is complementary:
109109

0 commit comments

Comments
 (0)