This document defines the stable evaluation policy for the canonical CodeScaleBench task set.
Use this document when you need to answer four questions precisely:
- what every canonical task must do
- what is allowed to vary by verifier family
- how artifact-oriented task variants relate to deterministic verification
- how reporting should interpret reward versus pass/fail
These rules apply to every canonical task, regardless of suite or verifier family:
- Every task has a deterministic verifier.
- Every deterministic verifier writes
/logs/verifier/reward.txt. - Canonical verifiers should also write
/logs/verifier/validation_result.json. validation_result.jsonis the semantic verifier contract;reward.txtis the scalar compatibility artifact.- Reporting must preserve continuous
rewardseparately from pass semantics.
The deterministic verifier is the authoritative benchmark outcome producer. Artifact-oriented flows do not replace it; they give the verifier a structured or family-specific input surface.
Canonical tasks intentionally use a hybrid output model. The benchmark does not require one universal agent artifact format.
Supported output-contract patterns include:
answer_json_native: the verifier directly scores a structured/workspace/answer.jsoncontractanswer_json_bridge: an artifact-oriented image or wrapper maps structured agent output into an existing deterministic verifier flowrepo_state: the verifier scores repository state and tests, with no required structured artifact- other family-specific contracts such as
solution_jsonorreport_markdown
Implications:
- Deterministic verification is universal.
- Artifact support is family-specific.
answer.jsonis common, but it is not universal benchmark policy.- Presence of
Dockerfile.artifact_onlydoes not imply the same verifier family or the same artifact semantics across tasks.
The maintained snapshot of current canonical coverage lives in
configs/canonical_evaluation_audit.json. Use that audit to answer
family-level questions such as which suites are answer_json_native,
answer_json_bridge, or still migrating to validation_result.json.
Canonical verifiers should publish semantics through
/logs/verifier/validation_result.json using
docs/reference/VALIDATION_RESULT_SCHEMA.md.
That sidecar is where verifiers declare:
statusandscorablescorer_familyrewardpass_thresholdpassedoutput_contractsub_scores- structured failure context
Downstream consumers should treat passed as the authoritative solved/pass
flag. They should not recompute solved status from reward > 0.
Reporting and export code must keep these concepts separate:
reward: continuous scalar produced by the deterministic verifierpassed: authoritative pass/fail flag from verifier semanticspass_threshold: task or family policy thresholdscorer_family: family that gives meaning to the rewardoutput_contract: verifier-facing output mode
Mean reward is still useful, but mixed-family aggregates require caveats. A
0.7 from test_ratio, oracle_checks, and checklist should not be treated
as silently calibrated equivalents.
Operationally:
- use
passed/statusfor pass-rate tables when available - use
rewardfor continuous-score summaries - surface
scorer_familyandoutput_contractin reports and exports - caveat or partition mixed-family reward aggregates
Preflight checks, smoke runs, and launch docs should assume:
- the deterministic verifier always exists
- required artifacts come from the task's published output contract
- missing required artifacts are invalid-output conditions, not ordinary benchmark misses
- artifact-oriented image variants must preserve the same verifier semantics, even when the agent-facing output path differs by family
The benchmark should therefore validate artifact expectations from task
metadata and verifier contract, not from a blanket assumption that every task
must produce /workspace/answer.json.