Skip to content

Commit c8a106f

Browse files
committed
Remove pair-count parenthetical from abstract retrieval sentence
1 parent a0e2b44 commit c8a106f

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/technical_reports/TECHNICAL_REPORT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
## Abstract
99

10-
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates coding agent capabilities with and without context retrieval **Model Context Protocol (MCP)** tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set (329 paired tasks), combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. For efficiency, the canonical haiku paired estimate shows average cost per task drops from **$0.7333** to **$0.5121** (**-30.16%**), with mean wall-clock delta of **-36.22s** and mean agent-execution delta of **-101.06s**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
10+
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates coding agent capabilities with and without context retrieval **Model Context Protocol (MCP)** tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set, combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. For efficiency, the canonical haiku paired estimate shows average cost per task drops from **$0.7333** to **$0.5121** (**-30.16%**), with mean wall-clock delta of **-36.22s** and mean agent-execution delta of **-101.06s**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
1111

1212
---
1313

0 commit comments

Comments
 (0)