Scoring Semantics

How each benchmark is scored, what the numbers mean, and known limitations.

Scoring Types

Type	Range	Description
binary	0.0 or 1.0	Pass/fail — all tests must pass
checklist	0.0–1.0 continuous	Weighted sum of boolean checks
test-ratio	0.0–1.0 continuous	Fraction of test cases passing
similarity	0.0–1.0 continuous	Semantic or keyword similarity to ground truth
F1-hybrid	0.0–1.0 continuous	Detection F1 blended with fix quality
ordering	0.0–1.0 continuous	Position-exact-match blended with rank correlation
external	0.0–1.0 continuous	External verifier (e.g., TheAgentCompany eval)

Canonical tasks should normalize these families into /logs/verifier/validation_result.json using docs/reference/VALIDATION_RESULT_SCHEMA.md. The reward type determines the meaning of reward and sub_scores, but not the top-level contract.

See docs/reference/CANONICAL_EVALUATION_POLICY.md for the stable policy that ties these families together: deterministic verifier reward is universal, artifact support is hybrid and family-specific, and reporting must keep reward separate from pass semantics.

Per-Verifier Scoring (Active Suites)

Tasks are organized into 8 SDLC-phase suites (csb_sdlc_understand through csb_sdlc_debug) and 10 Org suites (csb_org_*). Within each suite, individual tasks use one of the verifier types below, inherited from their source benchmark. See docs/TASK_CATALOG.md for which verifier each task uses and docs/TASK_SELECTION.md for the SDLC suite structure.

SWE-bench Pro

Type: test-ratio (pytest-based)
Mechanism: Runs project-specific pytest suite; score = fraction of tests passing
Good score: 1.0 (all repo tests pass)
Limitations: Upstream test suites vary in granularity. Some tasks have 1 test (binary effective), others have 50+. Score does not penalize regressions in unrelated tests.

LargeRepo

Type: checklist (compilation + keyword + test)
Mechanism: Weighted checks: keyword presence (0.3), relevant file changes in ≥2 files (0.2), test additions (0.2), unit tests pass (0.3)
Good score: ≥0.7 (keyword found, tests pass, multi-file changes)
Limitations: Keyword-based scoring can't verify implementation correctness. Compilation check provides strong signal but only covers modified packages.

DocGen

Type: checklist (weighted keyword checks)
Mechanism: Multi-check scoring: file exists (0.1), minimum content (0.1), keyword presence with negation filtering (0.8 across checks)
Good score: ≥0.7 (comprehensive documentation with key concepts)
Limitations: Keyword checks are context-aware (reject negated mentions) but can't verify technical accuracy. Minimum word count prevents empty-file gaming.

CrossRepo

Type: similarity (patch validation)
Mechanism: score = 0.4 * file_coverage + 0.6 * pattern_score
Good score: ≥0.6 (correct files modified with expected patterns)
Limitations: Pattern matching is regex-based and may miss valid alternative implementations. Fallback diff collection restricted to expected file paths to prevent gaming.

Enterprise

Type: checklist (multi-signal)
Mechanism: Weighted checks for impact analysis accuracy, refactoring completeness, dependency discovery precision
Good score: ≥0.6 (correct dependencies identified, accurate impact assessment)
Limitations: Enterprise tasks have complex multi-dimensional scoring; partial credit for partially correct analyses.

PyTorch

Type: diff_similarity (diff-based)
Mechanism: Compares agent's code changes against expected ground-truth diff; score = 0.35 * file_recall + 0.45 * line_recall + 0.20 * line_precision. sgt-001 uses a custom verifier (file + pattern checks).
Good score: >= 0.5 (correct files touched with matching changes)
Limitations: Diff similarity rewards matching the reference solution line-by-line; functionally equivalent but differently structured fixes may score lower.

NavProve

Type: checklist (navigation accuracy)
Mechanism: Checks whether agent correctly traced provenance, located target behaviors, and provided accurate justification
Good score: ≥0.6 (correct navigation path with supporting evidence)
Limitations: Scoring rewards matching expected navigation targets; alternative valid paths may receive lower scores.

CodeReview

Type: F1-hybrid (detection + fix quality)
Mechanism: score = 0.5 * detection_F1 + 0.5 * fix_score. Detection matches reported defects to expected defects by file path. Fix scoring checks for correct fix patterns with multiple acceptable alternatives.
Good score: ≥0.7 (most defects found and fixed)
Limitations: Fix pattern matching accepts alternatives but may miss novel correct approaches. Detection is file-path-based — reporting the right file but wrong defect counts as a match.

DIBench

Type: test-ratio
Mechanism: Runs dependency installation tests; score = fraction passing
Good score: 1.0 (all dependencies installed correctly)
Limitations: Network-dependent tasks may fail due to registry availability.

Governance

Type: checklist (policy compliance)
Mechanism: Checks for correct access control implementation, policy enforcement, audit trail completeness
Good score: ≥0.7 (policies correctly implemented and enforced)
Limitations: Binary checks for policy presence may miss subtle implementation errors.

NLQA

Type: similarity (answer quality)
Mechanism: Compares agent's natural-language answer against reference answer using keyword and semantic matching
Good score: ≥0.6 (accurate answer with relevant code references)
Limitations: NL similarity can't fully verify technical accuracy of complex architectural explanations.

Onboarding

Type: checklist (comprehension checks)
Mechanism: Checks for correct identification of key components, workflows, and dependencies
Good score: ≥0.6 (accurate orientation and component identification)
Limitations: Keyword-based checks may miss valid alternative descriptions.

Security

Type: checklist (security analysis accuracy)
Mechanism: Checks for correct CVE identification, reachability assessment, and mitigation recommendations
Good score: ≥0.7 (correct vulnerabilities identified with accurate reachability analysis)
Limitations: Security analysis scoring is pattern-based; novel but correct security insights may score lower.

TAC (TheAgentCompany)

Type: external (TAC eval.py)
Mechanism: External evaluator from TheAgentCompany; scores task completion
Good score: 1.0
Limitations: External verifier — not modified by this project.

LinuxFLBench

Type: checklist (fault localization accuracy)
Mechanism: Checks if agent identified correct buggy file and functions. Ground truth loaded from tests/ground_truth.json.
Good score: ≥0.7 (correct file and at least one correct function)
Limitations: Only accepts exact file path and function name matches.

Investigation

Type: checklist (investigation thoroughness)
Mechanism: Checks for correct root cause identification, evidence gathering, and impact assessment
Good score: ≥0.6 (correct root cause with supporting evidence)
Limitations: Multi-dimensional scoring may not capture all valid investigation approaches.

SWE-Perf

Type: external (task-specific verifier)
Mechanism: Performance benchmarks with custom verification
Good score: 1.0
Limitations: External verifiers — not modified by this project.

Score Distribution Expectations

Benchmark	Expected Baseline Range	Notes
SWE-bench Pro	0.3–0.5	Hard real-world bugs
LargeRepo	0.2–0.5	Large codebase navigation required
DocGen	0.5–0.8	Documentation generation is tractable
CrossRepo	0.3–0.6	Multi-repo coordination is hard
Enterprise	0.2–0.5	Complex multi-dimensional enterprise tasks
PyTorch	0.05–0.25	Diff similarity; 11 tasks
NavProve	0.3–0.6	Navigation and tracing tasks
CodeReview	0.3–0.6	Finding + fixing defects
DIBench	0.4–0.7	Dependency installation
Governance	0.3–0.6	Policy enforcement tasks
NLQA	0.3–0.6	Natural-language Q&A about code
Onboarding	0.3–0.6	Codebase orientation
Security	0.3–0.6	Security analysis
TAC	0.3–0.6	Tool-augmented, network-dependent
LinuxFLBench	0.2–0.5	Kernel fault localization
Investigation	0.3–0.6	Deep debugging
SWE-Perf	0.3–0.6	Performance optimization

Defect Annotation Format

Code review and security review tasks in csb_sdlc_test/ use tests/expected_defects.json to define ground truth defects. Each defect entry supports the following optional annotation fields for richer analysis. These fields are informational metadata; scoring logic is unchanged.

Optional Fields

Field	Type	Description
`defect_type`	string enum	Classifies the nature of the defect (see enum below)
`line_start`	integer	First line of the defect location in the source file
`line_end`	integer	Last line of the defect location in the source file

`defect_type` Enum Values

Value	Description
`null-deref`	Null/nil pointer dereference
`resource-leak`	Resource (memory, handle, cache) not properly released
`race-condition`	Concurrent access without proper synchronization
`injection`	Input validation bypass allowing injection or unauthorized input
`logic-error`	Inverted condition, off-by-one, wrong operator, or other logic mistake
`buffer-overflow`	Write past allocated buffer bounds
`use-after-free`	Access to memory after deallocation
`other`	Defect that does not fit the above categories

Example

{
  "id": "defect-1",
  "file": "lib/vtls/openssl.c",
  "line_start": 992,
  "line_end": 997,
  "type": "security",
  "severity": "high",
  "defect_type": "buffer-overflow",
  "description": "Buffer bounds check removed from SSL password callback."
}

Ground Truth Registry Integration

When defect_type is present in an expected_defects.json entry, the ground truth extraction in scripts/csb_metrics/ground_truth.py populates TaskGroundTruth.defect_annotations -- a list of DefectAnnotation objects carrying defect_id, file, defect_type, line_start, and line_end. Tasks without defect_type fields produce an empty annotations list. The serialized registry (configs/ground_truth_files.json) includes annotations only when non-empty.

CodeScaleBench-Org Suite Scoring (csb_org_* suites)

Org tasks use a unified oracle check library for deterministic scoring, with optional rubric judge for Deep Search synthesis tasks.

This section is Org-specific. The /workspace/answer.json format below is not the universal canonical benchmark contract; other families may use bridge-mode artifacts or repo-state verification instead.

Oracle Checks (scripts/csb_metrics/oracle_checks.py)

All Org tasks are scored by oracle_checks.py, a stdlib-only Python library invoked by eval.sh. There are 7 check types:

Check Type	Task Spec Field	Primary Score	Description
`file_set_match`	`required_files`	F1 (harmonic mean of recall + precision)	Files found vs oracle file list
`symbol_resolution`	`required_symbols`	Recall	Symbols found vs oracle symbol list
`dependency_chain`	`dependency_chains`	Chain recall	Chain steps found, order verified
`provenance`	`must_cite_paths` / `must_cite_repos`	Provenance score	Agent text cites required paths/repos
`keyword_presence`	`required_keywords`	Keyword recall	Required keywords in agent's answer text
`json_schema_match`	`schema_path`	1.0 if valid, 0.0 if invalid	Answer JSON validates against schema
`test_ratio`	`test_command`	Pass ratio (passed / total)	For Category I: generated code passes tests

Composite score = mean of primary scores across all configured checks.

composite_score = mean([check_1_primary_score, check_2_primary_score, ...])

Exit code: eval.sh exits 0 if composite > 0 (agent found something), 1 if composite == 0 (total failure). Harbor reads the score from /logs/verifier/reward.txt.

No hardcoded thresholds: Raw scores enable post-run calibration. See docs/ORG_CALIBRATION.md for threshold guidance after first runs complete.

Agent Answer Format

Agents write /workspace/answer.json for Org tasks using the native answer-artifact contract:

{
  "files": [{"repo": "org/name", "path": "path/to/file.go"}],
  "symbols": [{"repo": "org/name", "path": "path/to/file.go", "name": "SymbolName"}],
  "chain": [
    {"repo": "org/name", "path": "path/to/file.go", "symbol": "FirstStep"},
    {"repo": "org2/name2", "path": "path/to/file2.go", "symbol": "SecondStep"}
  ],
  "text": "Narrative explanation with citations to repos and file paths..."
}

The text field is required for provenance and keyword_presence checks (the oracle checks match substrings against it). The files, symbols, and chain fields are required for their respective check types.

Hybrid Scoring (Deep Search tasks)

Three tasks (marked deepsearch_relevant=true) additionally use rubric judge scoring:

CCX-onboard-050-ds, CCX-explore-042-ds, CCX-explore-091-ds

These tasks include tests/criteria.json with AAA criteria:

[
  {
    "metric": "cross_repo_synthesis",
    "description": "Accurate: correctly traces flow across repos. Attributed: cites specific files/repos. Actionable: identifies specific code entry points.",
    "max_score": 4
  }
]

Hybrid composite (configurable, default 60/40):

hybrid_score = 0.6 × verifier_reward + 0.4 × rubric_score

Run hybrid scoring:

python3 scripts/run_judge.py --hybrid --task CCX-onboard-050-ds

The judge_result.json output includes criteria_scores, rubric_score, and hybrid_composite when criteria.json is present.

Retrieval KPIs (scripts/csb_metrics/retrieval.py)

In addition to reward, retrieval metrics are extracted from agent transcripts:

Metric	Description
`oracle_coverage`	Fraction of oracle items found via any tool (local or MCP)
`time_to_first_oracle_hit_ms`	Time from task start to first oracle item found
`unique_repos_touched`	Number of distinct repos accessed during the task
`unique_orgs_touched`	Number of distinct GitHub orgs accessed
`mcp_tool_counts`	Count of each MCP tool used
`local_tool_counts`	Count of each local tool used

Key principle (Q10): oracle_coverage counts items found via any tool — both local grep and MCP. Baseline agents can score non-zero if oracle items happen to be in their local checkout. MCP advantage shows up in coverage of MCP-only repos.

Archived Suite Scoring (for reference)

The following suites are archived and not included in official evaluation:

LoCoBench: similarity (weighted multi-signal) — keyword overlap, file references, code blocks, structural coherence
RepoQA: similarity (semantic retrieval) — binary in practice (correct function or not)
DependEval: ordering (position-exact + Kendall tau) — score = 0.6 * position_exact_match + 0.4 * kendall_tau_normalized
K8s Docs: checklist (weighted keyword checks) — superseded by DocGen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring Semantics

Scoring Types

Per-Verifier Scoring (Active Suites)

SWE-bench Pro

LargeRepo

DocGen

CrossRepo

Enterprise

PyTorch

NavProve

CodeReview

DIBench

Governance

NLQA

Onboarding

Security

TAC (TheAgentCompany)

LinuxFLBench

Investigation

SWE-Perf

Score Distribution Expectations

Defect Annotation Format

Optional Fields

`defect_type` Enum Values

Example

Ground Truth Registry Integration

CodeScaleBench-Org Suite Scoring (csb_org_* suites)

Oracle Checks (scripts/csb_metrics/oracle_checks.py)

Agent Answer Format

Hybrid Scoring (Deep Search tasks)

Retrieval KPIs (scripts/csb_metrics/retrieval.py)

Archived Suite Scoring (for reference)

FilesExpand file tree

SCORING_SEMANTICS.md

Latest commit

History

SCORING_SEMANTICS.md

File metadata and controls

Scoring Semantics

Scoring Types

Per-Verifier Scoring (Active Suites)

SWE-bench Pro

LargeRepo

DocGen

CrossRepo

Enterprise

PyTorch

NavProve

CodeReview

DIBench

Governance

NLQA

Onboarding

Security

TAC (TheAgentCompany)

LinuxFLBench

Investigation

SWE-Perf

Score Distribution Expectations

Defect Annotation Format

Optional Fields

defect_type Enum Values

Example

Ground Truth Registry Integration

CodeScaleBench-Org Suite Scoring (csb_org_* suites)

Oracle Checks (scripts/csb_metrics/oracle_checks.py)

Agent Answer Format

Hybrid Scoring (Deep Search tasks)

Retrieval KPIs (scripts/csb_metrics/retrieval.py)

Archived Suite Scoring (for reference)

`defect_type` Enum Values