CodeScaleBench uses a multi-layer evaluation pipeline: deterministic verifiers run first (every task), then an optional LLM judge adds qualitative scoring, and statistical modules provide confidence intervals and correlation analysis.
This document covers the pipeline architecture. For per-benchmark scoring details, see SCORING_SEMANTICS.md. For Org oracle checks, see MCP_UNIQUE_TASKS.md. For the full retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR metrics, utilization probes, taxonomy, and emitted artifacts), see RETRIEVAL_EVAL_SPEC.md.
For canonical-task policy, read docs/reference/CANONICAL_EVALUATION_POLICY.md alongside this pipeline document.
Harbor run output (result.json, transcript)
│
▼
┌──────────────────────────────┐
│ Layer 1: Deterministic │ Always runs. Exit-code + reward.txt.
│ Verifier (test.sh / eval.sh)│ Per-task, in-container.
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Layer 2: LLM Judge │ Optional. Multi-round voting, 5 dimensions.
│ (scripts/run_judge.py) │ Post-run, host-side. Needs API key.
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Layer 3: Statistical │ Bootstrap CIs, paired delta tests,
│ Analysis │ Spearman correlation, retrieval metrics.
│ (scripts/csb_metrics/) │ Post-run, deterministic.
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Layer 4: Report Generator │ Aggregates all layers into REPORT.md
│ (generate_eval_report.py) │ and eval_report.json.
└──────────────────────────────┘
Every task ships a tests/test.sh (SDLC tasks) or tests/eval.sh (Org
tasks) that runs inside the Docker container after the agent finishes. The
verifier writes a reward (0.0–1.0) to /logs/verifier/reward.txt. Canonical
tasks should also emit /logs/verifier/validation_result.json using the schema
in docs/reference/VALIDATION_RESULT_SCHEMA.md
so downstream reporting can preserve scorer family, pass semantics, and failure
context.
This is the core hybrid-policy rule: deterministic verifier reward is
universal, but the agent-facing output contract is family-specific. Some tasks
score repo state directly, some natively score answer.json, and some use
artifact-oriented bridge variants that still feed the same verifier semantics.
Verifier types are documented in SCORING_SEMANTICS.md.
Set DEBUG_MODE=true to capture full diagnostics before verification runs:
DEBUG_MODE=true ./configs/fix_2config.sh --task my-task-001Debug output goes to /logs/verifier/debug/:
environment.txt— filtered env (secrets redacted)workspace_git_status.txt/workspace_git_diff.txtworkspace_file_tree.txt
Tasks can include tests/fixtures/ directories with known-score inputs to
validate that the verifier itself is correct:
tests/fixtures/
├── metadata.json # Expected score ranges + verifier type
├── perfect_input/ # Golden input → expected score ≥ 0.9
└── empty_input/ # Empty input → expected score ≤ 0.05
Run fixture self-tests:
python3 scripts/validate_tasks_preflight.py --fixture-tests
python3 scripts/validate_tasks_preflight.py --smoke-runtime # includes fixtures + idempotencyIdempotency checks run the verifier twice on the same input and assert scores match within epsilon (0.001). Non-idempotent verifiers are logged as warnings.
The judge system provides qualitative scoring across five dimensions, using multi-round voting for reliability. It runs post-hoc on completed runs — it does not affect the deterministic verifier score.
| Module | Path | Purpose |
|---|---|---|
| Data models | scripts/csb_metrics/judge/models.py |
JudgeInput, JudgeResult, OracleBundle dataclasses |
| Prompt templates | scripts/csb_metrics/judge/prompts.py |
Reference correctness, completeness, and direct review prompts |
| API backend | scripts/csb_metrics/judge/backends.py |
Anthropic API wrapper with exponential backoff (3 retries) |
| Engine | scripts/csb_metrics/judge/engine.py |
Core LLMJudge class: prompt selection, dimension scoring, multi-round voting |
| Agreement metrics | scripts/csb_metrics/judge/agreement.py |
Cohen's kappa, Fleiss' kappa, Landis-Koch interpretation |
| Oracle discovery | scripts/csb_metrics/judge/oracle.py |
Auto-loads ground truth from task data (6-source priority chain) |
| Dimension | Weight | Description |
|---|---|---|
| Correctness | 0.30 | Does the output match ground truth? |
| Completeness | 0.25 | Are all required elements present? |
| Code quality | 0.20 | Is the code well-structured and idiomatic? |
| Retrieval quality | 0.15 | Did the agent find the right context? |
| Efficiency | 0.10 | Was the approach reasonably efficient? |
The judge score is a weighted average of dimension scores. Each dimension is scored on a 3-point scale: 1.0 (pass), 0.5 (partial), 0.0 (fail).
The judge auto-loads ground truth via a 6-source priority chain:
tests/ground_truth.json— structured ground truth (high confidence)tests/expected_defects.json— code review defect lists (high confidence)tests/expected_changes.json— expected file changes (high confidence)solution/solve.sh— patch extraction (medium confidence)instruction.md— keyword extraction (medium confidence)configs/ground_truth_files.json— fallback registry (low confidence)
Confidence level flows into the judge result for downstream filtering.
# Score all tasks in a run
python3 scripts/run_judge.py --run runs/official/my_run/
# Score a specific suite with 3-round voting
python3 scripts/run_judge.py --run runs/official/my_run/ --suite csb_sdlc_fix --ensemble
# Dry run — show tasks and oracle confidence without calling API
python3 scripts/run_judge.py --run runs/official/my_run/ --dry-run
# Force re-score tasks that already have judge results
python3 scripts/run_judge.py --run runs/official/my_run/ --forceOutput: judge_result.json written alongside each task's result.json.
Org tasks with tests/criteria.json support hybrid evaluation:
composite = 0.6 * verifier_reward + 0.4 * rubric_score. Enable with
--hybrid flag on run_judge.py.
from csb_metrics.statistics import bootstrap_ci, paired_bootstrap_delta
# Single-sample CI
mean, ci_lower, ci_upper = bootstrap_ci(scores, n_bootstrap=1000, ci=0.95)
# Paired config comparison (baseline vs MCP-Full)
delta, ci_lower, ci_upper, p_value = paired_bootstrap_delta(
baseline_scores, mcp_scores, n_bootstrap=1000, ci=0.95
)All functions are stdlib-only (uses random.choices) with random.seed(42) for
reproducibility.
Spearman rank correlation between IR metrics and task rewards, stratified by suite:
python3 scripts/ir_analysis.py --correlate --min-confidence mediumCode review tasks in csb_sdlc_test support structured defect annotations in
expected_defects.json:
{
"defects": [{
"description": "Null pointer dereference on empty input",
"defect_type": "null-deref",
"line_start": 42,
"line_end": 45
}]
}Supported types: null-deref, resource-leak, race-condition, injection,
logic-error, buffer-overflow, use-after-free, other.
When judge results are available, the evaluation report and MANIFEST include both scores:
Each task entry gains optional fields:
judge_score(float) — weighted judge scorejudge_model(string) — model used for judgingjudge_dimensions(dict) — per-dimension scoresjudge_confidence(float) — multi-round voting confidence
Tasks without judge data have these fields set to null.
The eval report conditionally adds judge columns:
task_id | verifier_reward | judge_score | delta | oracle_confidence
-----------------+-----------------+-------------+--------+------------------
ccx-dep-trace-001| 0.85 | 0.90 | +0.05 | high
my-fix-task-002 | 1.00 | 0.75 | -0.25 | medium [DIVERGENT]
Tasks where abs(verifier_reward - judge_score) > 0.3 are flagged [DIVERGENT]
for manual review.
For canonical deterministic reporting, treat continuous reward and pass/fail as
distinct dimensions. Report generators should use verifier passed /
pass_threshold metadata when available and surface scorer_family plus
output_contract so mixed-family reward aggregates are explicitly caveated.
# Full evaluation report (includes judge data if available)
python3 scripts/generate_eval_report.py \
--runs-dir runs/official/ \
--output-dir ./eval_reports/
# Generate LLM judge context files for manual review
python3 -m scripts.csb_metrics.judge_context \
--runs-dir runs/official/ \
--benchmarks-dir ./benchmarks/ \
--output-dir ./judge_contexts/Output:
eval_report.json— full structured reportREPORT.md— markdown tables (performance, efficiency, tool utilization, judge correlation)harness_configs.json— exact harness configuration per run- CSV files per table for downstream analysis