Remove pair-count parenthetical from abstract retrieval sentence

sjarmak · sjarmak · commit c8a106f20bf6 · 2026-03-05T02:13:42.000Z
diff --git a/docs/technical_reports/TECHNICAL_REPORT.md b/docs/technical_reports/TECHNICAL_REPORT.md
@@ -7,7 +7,7 @@
 
 ## Abstract
 
-CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates coding agent capabilities with and without context retrieval **Model Context Protocol (MCP)** tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set (329 paired tasks), combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. For efficiency, the canonical haiku paired estimate shows average cost per task drops from **$0.7333** to **$0.5121** (**-30.16%**), with mean wall-clock delta of **-36.22s** and mean agent-execution delta of **-101.06s**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
+CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates coding agent capabilities with and without context retrieval **Model Context Protocol (MCP)** tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set, combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. For efficiency, the canonical haiku paired estimate shows average cost per task drops from **$0.7333** to **$0.5121** (**-30.16%**), with mean wall-clock delta of **-36.22s** and mean agent-execution delta of **-101.06s**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
 
 ---