Retrieval Evaluation Specification

Status: v1 — standalone, non-ranking. This framework evaluates retrieval quality and its downstream impact on task outcomes without changing primary CSB scoring or leaderboard semantics.

Purpose

Measure three aspects of agent retrieval behavior:

Retrieval quality — did the agent find the right files/symbols?
Utilization quality — did the agent use retrieved evidence correctly?
Downstream impact — how do retrieval metrics correlate with task outcomes, cost, and time?

Schema Overview

The normalized retrieval event schema (schemas/retrieval_events_schema.json, version 1.0) defines a single JSON document per task-config pair containing:

Section	Purpose
`provenance`	Run/task/config identification
`coverage`	Trace and ground-truth availability flags
`ground_truth`	Expected files, optional symbols and chunks
`events`	Ordered step-level retrieval events
`summary`	Pre-computed aggregate counts (optional)

Field Semantics

Provenance

Uniquely identifies the task execution:

run_id — staging or official run directory name.
batch_timestamp — batch subdirectory within the run.
task_name — canonical task identifier (matches task.toml name).
config_name — full config label (e.g. baseline-local-direct, mcp-remote-direct).
benchmark — suite name (e.g. csb_sdlc_fix, csb_org_crossorg).

Coverage Flags

Every document reports trace availability explicitly so downstream stages can filter or flag results:

has_trajectory — agent/trajectory.json was found and parseable.
has_transcript — agent/claude-code.txt (JSONL) was found and parseable.
has_ground_truth — file-level expected files exist for the task.
has_chunk_ground_truth — line-range annotations exist (e.g. defect locations in code-review tasks).
trace_source — which source produced the events:
- trajectory — events from trajectory.json only.
- transcript — events from claude-code.txt only.
- merged — events from both sources combined (trajectory preferred for tool calls, transcript for timestamps or subagent recovery).
- null — degraded mode (no usable trace).
degraded_reason — human-readable explanation when events are empty or incomplete.

Ground Truth

Ground truth is loaded from the task definition directory using the existing priority chain in csb_metrics/ground_truth.py:

tests/ground_truth.json (high confidence)
tests/expected_defects.json (high confidence)
tests/expected_changes.json (high confidence)
tests/reference_fix.patch / tests/expected.diff (high confidence)
solution/solve.sh gold patch (medium confidence)
instruction.md / tests/test.sh regex extraction (medium/low confidence)

Three levels of ground truth are supported:

File-level relevant files (ground_truth.files) — always populated when ground truth exists. These are the files considered relevant evidence for solving the task (not necessarily the only valid edit targets). Repo-relative paths.
Symbol-level (ground_truth.symbols) — optional. Function/class names within ground-truth files, loaded from task_spec.json oracle items.
Expected edit files (ground_truth.expected_edit_files) — optional, conservative edit-target file set inferred only from high-confidence sources such as expected_changes.json and patch-based references (reference_fix.patch, expected.diff, expected.patch, gold patch in solve.sh). This field is absent when edit-target semantics cannot be inferred reliably.
Chunk-level (ground_truth.chunks) — optional. Line ranges within files, loaded from expected_defects.json annotations or similar.

When coverage.has_ground_truth is false, ground_truth.files is an empty array and all IR metrics are marked as non-computable.

Retrieval Events

Each event represents one retrieval-related tool call by the agent:

step_index — zero-based position in the trace. Preserves execution order.
tool_name — raw name from the trace (e.g. Read, mcp__sourcegraph__sg_keyword_search).
tool_category — normalized category for cross-config comparison:

Category	Local tools	MCP tools
`file_read`	Read	read_file
`file_search`	Glob, Grep	list_files
`symbol_navigation`	—	find_references, go_to_definition
`code_search`	Grep (pattern)	keyword_search, nls_search
`commit_search`	—	commit_search, diff_search, compare_revisions
`deep_search`	—	deepsearch, deepsearch_read
`file_write`	Write, Edit	—
`other`	Bash, Task	get_contributor_repos, list_repos

is_mcp — true for any mcp__sourcegraph__* tool call.
target_files — normalized file paths accessed or returned. Normalization strips /workspace/, /repo_full/, /testbed/, and diff a//b/ prefixes; paths are lowercased for matching.
hits_ground_truth — true if any target_file matches a ground-truth file.
cumulative_tokens — running token total up to this step (when available).
elapsed_seconds — wall-clock time from agent execution start.

Event Summary

Optional pre-computed counts to avoid re-scanning the events array:

total_events, mcp_events, local_events
unique_files_accessed, ground_truth_files_hit
first_ground_truth_hit_step
events_by_category (keyed by tool_category)

Degraded Mode Behavior

The pipeline handles incomplete data gracefully:

Condition	Behavior
No trajectory AND no transcript	`events` is empty, `coverage.trace_source` is null, `coverage.degraded_reason` explains
Trajectory only (no transcript)	Events extracted from trajectory; timestamps may be absent for some steps
Transcript only (no trajectory)	Events extracted from transcript; subagent tool calls may be missed
No ground truth	`ground_truth.files` is empty; `hits_ground_truth` is false for all events; IR metrics non-computable
No chunk ground truth	`ground_truth.chunks` absent; chunk-level metrics emit `resolution: "file_level_only"` flag

Downstream metric stages MUST check coverage flags before computing metrics and propagate appropriate non_computable markers rather than emitting misleading zeroes.

Schema Versioning

The schema_version field is a semver-style string (currently "1.0").
Minor bumps (1.1, 1.2, ...) add optional fields. Consumers of 1.0 data continue to work unchanged.
Major bumps (2.0) change required fields or remove/rename existing ones. Consumers must update.
The normalization CLI embeds the schema version it was built against. Metric stages validate schema_version on load and reject unknown major versions.

Output Paths

Normalized retrieval event files are written to a parallel directory structure that does not overwrite existing run artifacts:

runs/{staging|official}/{run_id}/retrieval_events/
  {config_name}/
    {task_name}.retrieval_events.json

Run-level aggregates are written alongside:

runs/{staging|official}/{run_id}/retrieval_events/
  run_retrieval_summary.json

Pipeline Stages

The full evaluation pipeline (scripts/retrieval_eval_pipeline.py) runs five stages on each normalized event document:

Stage 1: File-Level IR Metrics

Standard information retrieval metrics computed from the ordered list of retrieved files against ground-truth files:

Precision@K, Recall@K, F1@K (K = 1, 3, 5, 10)
MRR (Mean Reciprocal Rank)
nDCG@K (normalized Discounted Cumulative Gain)
MAP (Mean Average Precision)
File-level recall (fraction of GT files found anywhere in retrieved list)
Context efficiency (fraction of retrieved files that are relevant)
TTFR (time-to-first-relevant file, in seconds and tokens)

Tasks without ground truth are marked computable: false.

Stage 2: Chunk-Level Relevance Metrics

When chunk-level ground truth (line-range annotations) is available:

Chunk recall = fraction of GT chunks whose file was accessed by the agent.
Resolution field: "chunk_level" or "file_level_only".
Validity field: "file_match_only" (v1 granularity) or "unsupported".

Chunking assumption: In v1, a retrieval event "covers" a ground-truth chunk if any target_file matches the chunk's file path. Sub-line matching (e.g. exact line range overlap) requires structured diff data and is deferred to future schema versions.

Stage 3: Utilization Probe Metrics

Measures whether retrieved evidence was actually used by the agent:

Primary cross-task probe: util_read_overlap_with_relevant_files = |files_read ∩ relevant_files| / |relevant_files|. Measures whether the agent actually read relevant files (including MCP read_file calls, which are normalized to file_read).
Task-dependent write proxy: util_write_overlap_with_relevant_files_proxy = |files_written ∩ relevant_files| / |relevant_files|. Useful for some fix-style tasks, but should not be treated as a universal utilization metric.
Stronger optional write probe: util_write_overlap_with_expected_edit_files = |files_written ∩ expected_edit_files| / |expected_edit_files| when ground_truth.expected_edit_files is available.
util_read_before_write_ratio = fraction of written files that were read by the agent before being written to. High values indicate deliberate evidence consumption.

Coverage:

probe_available: false only when file-level ground truth is missing.
Read-overlap probes are still computable for read-only tasks.
Write-overlap probes may be null for tasks without file writes.
expected_edit_probe_available: false when expected_edit_files is not inferable from high-confidence task metadata.

Limitations: These probes measure file-level utilization only. They do not validate whether the content written was semantically correct (that is the verifier's job). Future probes may add symbol-level or API-level checks.

Stage 4: Error Taxonomy and Calibration Slices

Five taxonomy labels classify retrieval error modes per-task:

Label	Definition
`irrelevant_retrieval`	Files retrieved that are not in ground truth
`missed_key_evidence`	Ground truth files never retrieved
`wrong_evidence_used`	Non-GT files the agent wrote to
`unused_correct_retrieval`	GT files retrieved but never written to
`ambiguity_near_miss`	Retrieved files in the same directory as a GT file

Two calibration slice dimensions:

Candidate set size: small (≤5 files), medium (6–20), large (>20)
Evidence type: local (no MCP tools used) or mcp (at least one MCP call)

Stage 5: Artifact Emission

Per-task artifacts ({task_name}.retrieval_metrics.json) contain all four metric stages plus provenance and coverage metadata. Run-level summaries (run_retrieval_summary.json) contain aggregated statistics across all computable tasks.

Relationship to Existing Pipeline

This evaluation is standalone and non-ranking in v1:

Does not modify result.json, task_metrics.json, or MANIFEST.json.
Does not affect verifier rewards or leaderboard scoring.
Consumes the same run artifacts as ir_analysis.py and mcp_audit.py.
Future versions may feed retrieval metrics into generate_eval_report.py as an optional supplementary section.

v1 Rollout Boundaries

What v1 Does

Normalizes agent traces into step-level retrieval events.
Computes file-level IR metrics, chunk-level metrics (with fallback), utilization probes, and error taxonomy.
Correlates retrieval metrics with task outcomes (association only).
Generates matched task comparisons between baseline and MCP configs.
Produces standalone human-readable reports.

What v1 Does NOT Do

Does not change verifier rewards, leaderboard scoring, or MANIFEST.json.
Does not block or gate benchmark runs on retrieval quality.
Does not modify existing evaluation pipeline outputs.
Does not claim causal relationships between retrieval and outcomes.

Comparability Requirements

Matched task comparisons require:

Same task executed in both baseline and MCP configs.
Same model and harness version across paired configs.
Result.json present with valid reward for both configs.
At least 3 matched tasks for aggregate statistics.

Unmatched tasks (present in one config but not the other) are excluded from matched comparisons but included in per-config aggregates.

Coverage Caveats

Tasks without file-level ground truth (Org discovery tasks, write-only tasks) are excluded from IR metrics.
Tasks in degraded mode (no trajectory or transcript) emit empty events and are flagged in coverage metadata.
Chunk-level metrics operate at file-match granularity in v1.

Future Integration Points

The following touchpoints exist for optional future integration. None of these should be implemented without explicit policy discussion.

`docs/EVALUATION_PIPELINE.md`

Optional Layer 5: Add retrieval evaluation as an optional post-run analysis layer alongside the existing 4-layer pipeline.
Retrieval metrics could appear as supplementary columns in the eval report tables without affecting the primary scoring dimensions.

`docs/SCORING_SEMANTICS.md`

Retrieval-aware composite scores: A future version could define a weighted composite that includes retrieval quality alongside verifier reward. This would require consensus on weight calibration and must not change existing per-task reward semantics.
Confidence gating: Tasks with low retrieval coverage could receive confidence flags that downstream consumers use for filtering but not score modification.

`docs/ORG_TASKS.md` / `docs/ORG_CALIBRATION.md`

Oracle coverage integration: Org task oracle items could be mapped to retrieval events for oracle-aware retrieval scoring.
Deep Search effectiveness: The deep_search tool category enables future analysis of Deep Search ROI versus keyword/NLS search.

`docs/LEADERBOARD.md`

Retrieval-conditioned rankings: Future leaderboard views could show rankings conditioned on retrieval quality tiers (e.g. "among tasks where the agent retrieved ≥50% of ground truth files"). This would be supplementary, not replacing the primary ranking.

`scripts/generate_eval_report.py`

Supplementary tables: A future version of the report generator could optionally include retrieval quality tables and correlation summaries from the retrieval pipeline output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Evaluation Specification

Purpose

Schema Overview

Field Semantics

Provenance

Coverage Flags

Ground Truth

Retrieval Events

Event Summary

Degraded Mode Behavior

Schema Versioning

Output Paths

Pipeline Stages

Stage 1: File-Level IR Metrics

Stage 2: Chunk-Level Relevance Metrics

Stage 3: Utilization Probe Metrics

Stage 4: Error Taxonomy and Calibration Slices

Stage 5: Artifact Emission

Relationship to Existing Pipeline

v1 Rollout Boundaries

What v1 Does

What v1 Does NOT Do

Comparability Requirements

Coverage Caveats

Future Integration Points

`docs/EVALUATION_PIPELINE.md`

`docs/SCORING_SEMANTICS.md`

`docs/ORG_TASKS.md` / `docs/ORG_CALIBRATION.md`

`docs/LEADERBOARD.md`

`scripts/generate_eval_report.py`

See Also

FilesExpand file tree

RETRIEVAL_EVAL_SPEC.md

Latest commit

History

RETRIEVAL_EVAL_SPEC.md

File metadata and controls

Retrieval Evaluation Specification

Purpose

Schema Overview

Field Semantics

Provenance

Coverage Flags

Ground Truth

Retrieval Events

Event Summary

Degraded Mode Behavior

Schema Versioning

Output Paths

Pipeline Stages

Stage 1: File-Level IR Metrics

Stage 2: Chunk-Level Relevance Metrics

Stage 3: Utilization Probe Metrics

Stage 4: Error Taxonomy and Calibration Slices

Stage 5: Artifact Emission

Relationship to Existing Pipeline

v1 Rollout Boundaries

What v1 Does

What v1 Does NOT Do

Comparability Requirements

Coverage Caveats

Future Integration Points

docs/EVALUATION_PIPELINE.md

docs/SCORING_SEMANTICS.md

docs/ORG_TASKS.md / docs/ORG_CALIBRATION.md

docs/LEADERBOARD.md

scripts/generate_eval_report.py

See Also

`docs/EVALUATION_PIPELINE.md`

`docs/SCORING_SEMANTICS.md`

`docs/ORG_TASKS.md` / `docs/ORG_CALIBRATION.md`

`docs/LEADERBOARD.md`

`scripts/generate_eval_report.py`