Status: v1 — standalone, non-ranking. This framework evaluates retrieval quality and its downstream impact on task outcomes without changing primary CSB scoring or leaderboard semantics.
Measure three aspects of agent retrieval behavior:
- Retrieval quality — did the agent find the right files/symbols?
- Utilization quality — did the agent use retrieved evidence correctly?
- Downstream impact — how do retrieval metrics correlate with task outcomes, cost, and time?
The normalized retrieval event schema
(schemas/retrieval_events_schema.json, version 1.0) defines a single
JSON document per task-config pair containing:
| Section | Purpose |
|---|---|
provenance |
Run/task/config identification |
coverage |
Trace and ground-truth availability flags |
ground_truth |
Expected files, optional symbols and chunks |
events |
Ordered step-level retrieval events |
summary |
Pre-computed aggregate counts (optional) |
Uniquely identifies the task execution:
run_id— staging or official run directory name.batch_timestamp— batch subdirectory within the run.task_name— canonical task identifier (matchestask.tomlname).config_name— full config label (e.g.baseline-local-direct,mcp-remote-direct).benchmark— suite name (e.g.csb_sdlc_fix,csb_org_crossorg).
Every document reports trace availability explicitly so downstream stages can filter or flag results:
has_trajectory—agent/trajectory.jsonwas found and parseable.has_transcript—agent/claude-code.txt(JSONL) was found and parseable.has_ground_truth— file-level expected files exist for the task.has_chunk_ground_truth— line-range annotations exist (e.g. defect locations in code-review tasks).trace_source— which source produced the events:trajectory— events fromtrajectory.jsononly.transcript— events fromclaude-code.txtonly.merged— events from both sources combined (trajectory preferred for tool calls, transcript for timestamps or subagent recovery).null— degraded mode (no usable trace).
degraded_reason— human-readable explanation when events are empty or incomplete.
Ground truth is loaded from the task definition directory using the existing
priority chain in csb_metrics/ground_truth.py:
tests/ground_truth.json(high confidence)tests/expected_defects.json(high confidence)tests/expected_changes.json(high confidence)tests/reference_fix.patch/tests/expected.diff(high confidence)solution/solve.shgold patch (medium confidence)instruction.md/tests/test.shregex extraction (medium/low confidence)
Three levels of ground truth are supported:
- File-level relevant files (
ground_truth.files) — always populated when ground truth exists. These are the files considered relevant evidence for solving the task (not necessarily the only valid edit targets). Repo-relative paths. - Symbol-level (
ground_truth.symbols) — optional. Function/class names within ground-truth files, loaded fromtask_spec.jsonoracle items. - Expected edit files (
ground_truth.expected_edit_files) — optional, conservative edit-target file set inferred only from high-confidence sources such asexpected_changes.jsonand patch-based references (reference_fix.patch,expected.diff,expected.patch, gold patch insolve.sh). This field is absent when edit-target semantics cannot be inferred reliably. - Chunk-level (
ground_truth.chunks) — optional. Line ranges within files, loaded fromexpected_defects.jsonannotations or similar.
When coverage.has_ground_truth is false, ground_truth.files is an empty
array and all IR metrics are marked as non-computable.
Each event represents one retrieval-related tool call by the agent:
step_index— zero-based position in the trace. Preserves execution order.tool_name— raw name from the trace (e.g.Read,mcp__sourcegraph__sg_keyword_search).tool_category— normalized category for cross-config comparison:
| Category | Local tools | MCP tools |
|---|---|---|
file_read |
Read | read_file |
file_search |
Glob, Grep | list_files |
symbol_navigation |
— | find_references, go_to_definition |
code_search |
Grep (pattern) | keyword_search, nls_search |
commit_search |
— | commit_search, diff_search, compare_revisions |
deep_search |
— | deepsearch, deepsearch_read |
file_write |
Write, Edit | — |
other |
Bash, Task | get_contributor_repos, list_repos |
is_mcp— true for anymcp__sourcegraph__*tool call.target_files— normalized file paths accessed or returned. Normalization strips/workspace/,/repo_full/,/testbed/, and diffa//b/prefixes; paths are lowercased for matching.hits_ground_truth— true if anytarget_filematches a ground-truth file.cumulative_tokens— running token total up to this step (when available).elapsed_seconds— wall-clock time from agent execution start.
Optional pre-computed counts to avoid re-scanning the events array:
total_events,mcp_events,local_eventsunique_files_accessed,ground_truth_files_hitfirst_ground_truth_hit_stepevents_by_category(keyed bytool_category)
The pipeline handles incomplete data gracefully:
| Condition | Behavior |
|---|---|
| No trajectory AND no transcript | events is empty, coverage.trace_source is null, coverage.degraded_reason explains |
| Trajectory only (no transcript) | Events extracted from trajectory; timestamps may be absent for some steps |
| Transcript only (no trajectory) | Events extracted from transcript; subagent tool calls may be missed |
| No ground truth | ground_truth.files is empty; hits_ground_truth is false for all events; IR metrics non-computable |
| No chunk ground truth | ground_truth.chunks absent; chunk-level metrics emit resolution: "file_level_only" flag |
Downstream metric stages MUST check coverage flags before computing metrics
and propagate appropriate non_computable markers rather than emitting
misleading zeroes.
- The
schema_versionfield is a semver-style string (currently"1.0"). - Minor bumps (1.1, 1.2, ...) add optional fields. Consumers of 1.0 data continue to work unchanged.
- Major bumps (2.0) change required fields or remove/rename existing ones. Consumers must update.
- The normalization CLI embeds the schema version it was built against.
Metric stages validate
schema_versionon load and reject unknown major versions.
Normalized retrieval event files are written to a parallel directory structure that does not overwrite existing run artifacts:
runs/{staging|official}/{run_id}/retrieval_events/
{config_name}/
{task_name}.retrieval_events.json
Run-level aggregates are written alongside:
runs/{staging|official}/{run_id}/retrieval_events/
run_retrieval_summary.json
The full evaluation pipeline (scripts/retrieval_eval_pipeline.py) runs five
stages on each normalized event document:
Standard information retrieval metrics computed from the ordered list of retrieved files against ground-truth files:
- Precision@K, Recall@K, F1@K (K = 1, 3, 5, 10)
- MRR (Mean Reciprocal Rank)
- nDCG@K (normalized Discounted Cumulative Gain)
- MAP (Mean Average Precision)
- File-level recall (fraction of GT files found anywhere in retrieved list)
- Context efficiency (fraction of retrieved files that are relevant)
- TTFR (time-to-first-relevant file, in seconds and tokens)
Tasks without ground truth are marked computable: false.
When chunk-level ground truth (line-range annotations) is available:
- Chunk recall = fraction of GT chunks whose file was accessed by the agent.
- Resolution field:
"chunk_level"or"file_level_only". - Validity field:
"file_match_only"(v1 granularity) or"unsupported".
Chunking assumption: In v1, a retrieval event "covers" a ground-truth
chunk if any target_file matches the chunk's file path. Sub-line matching
(e.g. exact line range overlap) requires structured diff data and is deferred
to future schema versions.
Measures whether retrieved evidence was actually used by the agent:
- Primary cross-task probe:
util_read_overlap_with_relevant_files= |files_read ∩ relevant_files| / |relevant_files|. Measures whether the agent actually read relevant files (including MCPread_filecalls, which are normalized tofile_read). - Task-dependent write proxy:
util_write_overlap_with_relevant_files_proxy= |files_written ∩ relevant_files| / |relevant_files|. Useful for some fix-style tasks, but should not be treated as a universal utilization metric. - Stronger optional write probe:
util_write_overlap_with_expected_edit_files= |files_written ∩ expected_edit_files| / |expected_edit_files| whenground_truth.expected_edit_filesis available. util_read_before_write_ratio= fraction of written files that were read by the agent before being written to. High values indicate deliberate evidence consumption.
Coverage:
probe_available: falseonly when file-level ground truth is missing.- Read-overlap probes are still computable for read-only tasks.
- Write-overlap probes may be null for tasks without file writes.
expected_edit_probe_available: falsewhenexpected_edit_filesis not inferable from high-confidence task metadata.
Limitations: These probes measure file-level utilization only. They do not validate whether the content written was semantically correct (that is the verifier's job). Future probes may add symbol-level or API-level checks.
Five taxonomy labels classify retrieval error modes per-task:
| Label | Definition |
|---|---|
irrelevant_retrieval |
Files retrieved that are not in ground truth |
missed_key_evidence |
Ground truth files never retrieved |
wrong_evidence_used |
Non-GT files the agent wrote to |
unused_correct_retrieval |
GT files retrieved but never written to |
ambiguity_near_miss |
Retrieved files in the same directory as a GT file |
Two calibration slice dimensions:
- Candidate set size:
small(≤5 files),medium(6–20),large(>20) - Evidence type:
local(no MCP tools used) ormcp(at least one MCP call)
Per-task artifacts ({task_name}.retrieval_metrics.json) contain all four
metric stages plus provenance and coverage metadata. Run-level summaries
(run_retrieval_summary.json) contain aggregated statistics across all
computable tasks.
This evaluation is standalone and non-ranking in v1:
- Does not modify
result.json,task_metrics.json, orMANIFEST.json. - Does not affect verifier rewards or leaderboard scoring.
- Consumes the same run artifacts as
ir_analysis.pyandmcp_audit.py. - Future versions may feed retrieval metrics into
generate_eval_report.pyas an optional supplementary section.
- Normalizes agent traces into step-level retrieval events.
- Computes file-level IR metrics, chunk-level metrics (with fallback), utilization probes, and error taxonomy.
- Correlates retrieval metrics with task outcomes (association only).
- Generates matched task comparisons between baseline and MCP configs.
- Produces standalone human-readable reports.
- Does not change verifier rewards, leaderboard scoring, or MANIFEST.json.
- Does not block or gate benchmark runs on retrieval quality.
- Does not modify existing evaluation pipeline outputs.
- Does not claim causal relationships between retrieval and outcomes.
Matched task comparisons require:
- Same task executed in both baseline and MCP configs.
- Same model and harness version across paired configs.
- Result.json present with valid reward for both configs.
- At least 3 matched tasks for aggregate statistics.
Unmatched tasks (present in one config but not the other) are excluded from matched comparisons but included in per-config aggregates.
- Tasks without file-level ground truth (Org discovery tasks, write-only tasks) are excluded from IR metrics.
- Tasks in degraded mode (no trajectory or transcript) emit empty events and are flagged in coverage metadata.
- Chunk-level metrics operate at file-match granularity in v1.
The following touchpoints exist for optional future integration. None of these should be implemented without explicit policy discussion.
- Optional Layer 5: Add retrieval evaluation as an optional post-run analysis layer alongside the existing 4-layer pipeline.
- Retrieval metrics could appear as supplementary columns in the eval report tables without affecting the primary scoring dimensions.
- Retrieval-aware composite scores: A future version could define a weighted composite that includes retrieval quality alongside verifier reward. This would require consensus on weight calibration and must not change existing per-task reward semantics.
- Confidence gating: Tasks with low retrieval coverage could receive confidence flags that downstream consumers use for filtering but not score modification.
- Oracle coverage integration: Org task oracle items could be mapped to retrieval events for oracle-aware retrieval scoring.
- Deep Search effectiveness: The
deep_searchtool category enables future analysis of Deep Search ROI versus keyword/NLS search.
- Retrieval-conditioned rankings: Future leaderboard views could show rankings conditioned on retrieval quality tiers (e.g. "among tasks where the agent retrieved ≥50% of ground truth files"). This would be supplementary, not replacing the primary ranking.
- Supplementary tables: A future version of the report generator could optionally include retrieval quality tables and correlation summaries from the retrieval pipeline output.
schemas/retrieval_events_schema.json— JSON Schema definitiondocs/EVALUATION_PIPELINE.md— primary evaluation pipelinedocs/SCORING_SEMANTICS.md— reward interpretationdocs/ORG_TASKS.md— Org task system