sourcegraph
diff --git a/‎configs/canonical_evaluation_audit.json‎
Lines changed: 1963 additions & 1 deletion b/‎configs/canonical_evaluation_audit.json‎
Lines changed: 1963 additions & 1 deletion
diff --git a/‎docs/AGENT_INTERFACE.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/AGENT_INTERFACE.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/EVALUATION_PIPELINE.md‎
Lines changed: 5 additions & 2 deletions b/‎docs/EVALUATION_PIPELINE.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/REPORT_CONTEXT.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/REPORT_CONTEXT.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/SCORING_SEMANTICS.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/SCORING_SEMANTICS.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/reference/README.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/reference/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/reference/TASK_CONTRACT.md‎
Lines changed: 7 additions & 0 deletions b/‎docs/reference/TASK_CONTRACT.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/reference/VALIDATION_RESULT_SCHEMA.md‎
Lines changed: 151 additions & 0 deletions b/‎docs/reference/VALIDATION_RESULT_SCHEMA.md‎
Lines changed: 151 additions & 0 deletions
@@ -46,7 +46,10 @@ The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the cont
 
 1. Runs inside the container after the agent completes
 2. Writes a reward to `/logs/verifier/reward.txt` as a plain decimal float (0.0-1.0)
-3. Always exits 0 (Harbor reads the reward value, not the exit code)
+3. For canonical tasks, should also write `/logs/verifier/validation_result.json`
+   following `docs/reference/VALIDATION_RESULT_SCHEMA.md`
+4. May use non-zero exit codes to distinguish scored failure from verifier/runtime failure;
+   Harbor still reads the scalar reward artifact
 
 ### Result Format
 
 
@@ -50,8 +50,11 @@ Harbor run output (result.json, transcript)
 
 Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (Org
 tasks) that runs inside the Docker container after the agent finishes. The
-verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt` and exits 0
-on success, non-zero on failure.
+verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt`. Canonical
+tasks should also emit `/logs/verifier/validation_result.json` using the schema
+in [docs/reference/VALIDATION_RESULT_SCHEMA.md](reference/VALIDATION_RESULT_SCHEMA.md)
+so downstream reporting can preserve scorer family, pass semantics, and failure
+context.
 
 Verifier types are documented in [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md).
 
 
@@ -135,6 +135,9 @@ The evaluation uses a multi-layer pipeline:
 1. **Deterministic verifier** (every task): Task-specific `test.sh` or
    `eval.sh` runs inside the Docker container after the agent finishes.
    Produces a reward score (0.0--1.0) written to `/logs/verifier/reward.txt`.
+   Canonical tasks are converging on a paired
+   `/logs/verifier/validation_result.json` sidecar so scorer family,
+   pass/fail semantics, sub-scores, and invalid-output context are preserved.
 
 2. **Optional LLM judge**: Post-hoc qualitative scoring across five
    dimensions (correctness 0.30, completeness 0.25, code quality 0.20,
 
@@ -14,6 +14,11 @@ How each benchmark is scored, what the numbers mean, and known limitations.
 | **ordering** | 0.0–1.0 continuous | Position-exact-match blended with rank correlation |
 | **external** | 0.0–1.0 continuous | External verifier (e.g., TheAgentCompany eval) |
 
+Canonical tasks should normalize these families into
+`/logs/verifier/validation_result.json` using
+`docs/reference/VALIDATION_RESULT_SCHEMA.md`. The reward type determines the
+meaning of `reward` and `sub_scores`, but not the top-level contract.
+
 ## Per-Verifier Scoring (Active Suites)
 
 Tasks are organized into 8 SDLC-phase suites (`csb_sdlc_understand` through `csb_sdlc_debug`)
 
@@ -11,6 +11,7 @@ Stable specifications and policy/reference documents.
 ## Evaluation / Scoring
 - `docs/SCORING_SEMANTICS.md`
 - `docs/EVALUATION_PIPELINE.md`
+- `docs/reference/VALIDATION_RESULT_SCHEMA.md`
 - `docs/RETRIEVAL_EVAL_SPEC.md`
 - `docs/ORG_CALIBRATION.md`
 
 
@@ -70,10 +70,17 @@ Missing required output is an invalid run when the verifier cannot establish
 that the agent actually produced a scorable answer. It should not silently look
 like a clean `reward=0.0` benchmark miss.
 
+Canonical tasks should treat `reward.txt` as the scalar compatibility artifact
+and `validation_result.json` as the semantic verifier contract. The JSON
+sidecar is where verifiers should record scorer family, pass semantics,
+sub-scores, and failure context.
+
 At minimum, verifiers should:
 
 - emit a clear error for missing required output
 - write reward artifacts only after classifying whether the run was scorable
+- for canonical tasks, also write `/logs/verifier/validation_result.json`
+  using `docs/reference/VALIDATION_RESULT_SCHEMA.md`
 - avoid hardcoded assumptions like `cd /app` unless that is the published task contract
 
 ## Image Variant Parity
 
@@ -0,0 +1,151 @@
+# Validation Result Schema
+
+Canonical tasks should converge on a structured verifier sidecar at
+`/logs/verifier/validation_result.json` in addition to the required
+`/logs/verifier/reward.txt`.
+
+This schema standardizes verifier semantics across scalar-only shell verifiers,
+answer.json artifact verifiers, repo-state verifiers, and oracle-based promoted
+tasks. It is intentionally simple enough to emit from shell or Python.
+
+## Required Top-Level Fields
+
+Every canonical `validation_result.json` should emit these keys, even when the
+verifier fails or the run is invalid:
+
+| Field | Type | Meaning |
+|-------|------|---------|
+| `schema_version` | string | Schema identifier. Current proposal: `validation_result.v1alpha1` |
+| `status` | string | One of `scored`, `invalid_output`, `verifier_error` |
+| `scorable` | boolean | `true` when the verifier had enough task output to score the run |
+| `scorer_family` | string | Normalized verifier family (`oracle_checks`, `test_ratio`, `f1_hybrid`, etc.) |
+| `reward` | number | Canonical scalar reward in `[0.0, 1.0]` |
+| `pass_threshold` | number | Family/task policy threshold associated with `passed` |
+| `passed` | boolean | Authoritative pass/fail flag for the task outcome |
+| `output_contract` | object | Published verifier-facing output mode and primary artifact path |
+| `sub_scores` | object | Per-check or per-component scores. Use `{}` when the family is scalar-only |
+| `failure` | object or `null` | Structured failure/error context. `null` for scored runs |
+
+Downstream consumers should treat `passed` as authoritative. `pass_threshold`
+is included so reporting can preserve task policy, but parsers should not
+recompute `passed` from `reward` alone.
+
+## Required `output_contract` Fields
+
+`output_contract` should always contain:
+
+| Field | Type | Meaning |
+|-------|------|---------|
+| `mode` | string | One of `answer_json_native`, `answer_json_bridge`, `repo_state`, `solution_json`, `report_markdown`, `unspecified` |
+| `primary_path` | string or `null` | Primary artifact path the verifier expected, if any |
+| `required_artifact` | boolean | Whether a missing primary artifact makes the run unscorable |
+
+## Failure Object
+
+When `status != "scored"`, `failure` should be populated with:
+
+| Field | Type | Meaning |
+|-------|------|---------|
+| `code` | string | Stable machine-readable error code (`missing_required_output`, `invalid_json`, `verifier_exception`, etc.) |
+| `message` | string | Human-readable summary |
+| `stage` | string | Usually `output_validation`, `scoring`, or `verifier_runtime` |
+
+`reward` should still be written as `0.0` so existing Harbor/reporting flows do
+not break while migration is in progress.
+
+## Optional Fields
+
+These fields are recommended when available:
+
+- `details`: family-specific raw verifier data or diagnostics
+- `artifacts`: paths to helper files emitted by the verifier
+- `timing`: verifier-local timing if a family captures it
+- `legacy`: compatibility payloads preserved for existing readers
+
+## Family Mapping
+
+Current canonical verifier families should map into the schema as follows:
+
+| Family | Canonical `reward` | Recommended `sub_scores` |
+|-------|---------------------|--------------------------|
+| `oracle_checks` | suite-weighted composite | one entry per oracle check (`file_set_match`, `symbol_resolution`, `dependency_chain`, `keyword_presence`, `provenance`, `json_schema_match`, `test_ratio`) |
+| `semantic_retrieval_qa` | primary QA correctness score | `correct_function`, `correct_path`, `justification_score` |
+| `f1_hybrid` | blended detection/fix score | `detection_f1`, `fix_score` |
+| `f1` | F1 score | `precision`, `recall`, `f1` |
+| `test_ratio` | pass ratio | `tests_passed_ratio` plus counts in `details` |
+| `diff_similarity` | diff similarity composite | `file_recall`, `line_recall`, `line_precision` |
+| `semantic_similarity` | similarity score | `similarity` |
+| `checklist` | weighted checklist score | stable check ids under `sub_scores.checks.*` |
+| `ir_checklist` | blended retrieval/checklist score | retrieval-oriented check ids under `sub_scores.checks.*` |
+| `find_and_prove` | composite regression-proof score | stable assertion ids under `sub_scores.checks.*` |
+| `repo_state_heuristic` | repo-state score | stable assertion ids under `sub_scores.checks.*` |
+| `continuous` | family-defined continuous score | `continuous_score` or family-specific metric key |
+| `binary` | `0.0` or `1.0` | `binary_pass` |
+
+The family determines the expected `sub_scores` shape, but not the top-level
+contract. The top-level keys stay fixed across all families.
+
+## Migration Rules
+
+Normalize current canonical tasks using these rules:
+
+1. Existing `validation_result.json` payloads:
+   Preserve legacy fields if needed, but add the canonical required keys.
+2. Existing `reward.json` payloads:
+   Wrap the scalar score into the canonical schema and keep the old file only as
+   a temporary compatibility artifact.
+3. `reward.txt`-only tasks:
+   Keep `reward.txt`, add `validation_result.json`, and emit `sub_scores={}` if
+   the family has no natural breakdown today.
+4. Missing required task output:
+   Emit `status="invalid_output"`, `scorable=false`, `reward=0.0`,
+   `passed=false`, and a populated `failure` object.
+5. Verifier exceptions:
+   Emit `status="verifier_error"`, `scorable=false`, `reward=0.0`,
+   `passed=false`, and a populated `failure` object.
+
+## Minimal Scored Example
+
+```json
+{
+  "schema_version": "validation_result.v1alpha1",
+  "status": "scored",
+  "scorable": true,
+  "scorer_family": "test_ratio",
+  "reward": 0.75,
+  "pass_threshold": 0.0,
+  "passed": true,
+  "output_contract": {
+    "mode": "repo_state",
+    "primary_path": null,
+    "required_artifact": false
+  },
+  "sub_scores": {},
+  "failure": null
+}
+```
+
+## Minimal Invalid-Output Example
+
+```json
+{
+  "schema_version": "validation_result.v1alpha1",
+  "status": "invalid_output",
+  "scorable": false,
+  "scorer_family": "oracle_checks",
+  "reward": 0.0,
+  "pass_threshold": 0.0,
+  "passed": false,
+  "output_contract": {
+    "mode": "answer_json_native",
+    "primary_path": "/workspace/answer.json",
+    "required_artifact": true
+  },
+  "sub_scores": {},
+  "failure": {
+    "code": "missing_required_output",
+    "message": "answer.json not found at /workspace/answer.json",
+    "stage": "output_validation"
+  }
+}
+```