|
| 1 | +# Validation Result Schema |
| 2 | + |
| 3 | +Canonical tasks should converge on a structured verifier sidecar at |
| 4 | +`/logs/verifier/validation_result.json` in addition to the required |
| 5 | +`/logs/verifier/reward.txt`. |
| 6 | + |
| 7 | +This schema standardizes verifier semantics across scalar-only shell verifiers, |
| 8 | +answer.json artifact verifiers, repo-state verifiers, and oracle-based promoted |
| 9 | +tasks. It is intentionally simple enough to emit from shell or Python. |
| 10 | + |
| 11 | +## Required Top-Level Fields |
| 12 | + |
| 13 | +Every canonical `validation_result.json` should emit these keys, even when the |
| 14 | +verifier fails or the run is invalid: |
| 15 | + |
| 16 | +| Field | Type | Meaning | |
| 17 | +|-------|------|---------| |
| 18 | +| `schema_version` | string | Schema identifier. Current proposal: `validation_result.v1alpha1` | |
| 19 | +| `status` | string | One of `scored`, `invalid_output`, `verifier_error` | |
| 20 | +| `scorable` | boolean | `true` when the verifier had enough task output to score the run | |
| 21 | +| `scorer_family` | string | Normalized verifier family (`oracle_checks`, `test_ratio`, `f1_hybrid`, etc.) | |
| 22 | +| `reward` | number | Canonical scalar reward in `[0.0, 1.0]` | |
| 23 | +| `pass_threshold` | number | Family/task policy threshold associated with `passed` | |
| 24 | +| `passed` | boolean | Authoritative pass/fail flag for the task outcome | |
| 25 | +| `output_contract` | object | Published verifier-facing output mode and primary artifact path | |
| 26 | +| `sub_scores` | object | Per-check or per-component scores. Use `{}` when the family is scalar-only | |
| 27 | +| `failure` | object or `null` | Structured failure/error context. `null` for scored runs | |
| 28 | + |
| 29 | +Downstream consumers should treat `passed` as authoritative. `pass_threshold` |
| 30 | +is included so reporting can preserve task policy, but parsers should not |
| 31 | +recompute `passed` from `reward` alone. |
| 32 | + |
| 33 | +## Required `output_contract` Fields |
| 34 | + |
| 35 | +`output_contract` should always contain: |
| 36 | + |
| 37 | +| Field | Type | Meaning | |
| 38 | +|-------|------|---------| |
| 39 | +| `mode` | string | One of `answer_json_native`, `answer_json_bridge`, `repo_state`, `solution_json`, `report_markdown`, `unspecified` | |
| 40 | +| `primary_path` | string or `null` | Primary artifact path the verifier expected, if any | |
| 41 | +| `required_artifact` | boolean | Whether a missing primary artifact makes the run unscorable | |
| 42 | + |
| 43 | +## Failure Object |
| 44 | + |
| 45 | +When `status != "scored"`, `failure` should be populated with: |
| 46 | + |
| 47 | +| Field | Type | Meaning | |
| 48 | +|-------|------|---------| |
| 49 | +| `code` | string | Stable machine-readable error code (`missing_required_output`, `invalid_json`, `verifier_exception`, etc.) | |
| 50 | +| `message` | string | Human-readable summary | |
| 51 | +| `stage` | string | Usually `output_validation`, `scoring`, or `verifier_runtime` | |
| 52 | + |
| 53 | +`reward` should still be written as `0.0` so existing Harbor/reporting flows do |
| 54 | +not break while migration is in progress. |
| 55 | + |
| 56 | +## Optional Fields |
| 57 | + |
| 58 | +These fields are recommended when available: |
| 59 | + |
| 60 | +- `details`: family-specific raw verifier data or diagnostics |
| 61 | +- `artifacts`: paths to helper files emitted by the verifier |
| 62 | +- `timing`: verifier-local timing if a family captures it |
| 63 | +- `legacy`: compatibility payloads preserved for existing readers |
| 64 | + |
| 65 | +## Family Mapping |
| 66 | + |
| 67 | +Current canonical verifier families should map into the schema as follows: |
| 68 | + |
| 69 | +| Family | Canonical `reward` | Recommended `sub_scores` | |
| 70 | +|-------|---------------------|--------------------------| |
| 71 | +| `oracle_checks` | suite-weighted composite | one entry per oracle check (`file_set_match`, `symbol_resolution`, `dependency_chain`, `keyword_presence`, `provenance`, `json_schema_match`, `test_ratio`) | |
| 72 | +| `semantic_retrieval_qa` | primary QA correctness score | `correct_function`, `correct_path`, `justification_score` | |
| 73 | +| `f1_hybrid` | blended detection/fix score | `detection_f1`, `fix_score` | |
| 74 | +| `f1` | F1 score | `precision`, `recall`, `f1` | |
| 75 | +| `test_ratio` | pass ratio | `tests_passed_ratio` plus counts in `details` | |
| 76 | +| `diff_similarity` | diff similarity composite | `file_recall`, `line_recall`, `line_precision` | |
| 77 | +| `semantic_similarity` | similarity score | `similarity` | |
| 78 | +| `checklist` | weighted checklist score | stable check ids under `sub_scores.checks.*` | |
| 79 | +| `ir_checklist` | blended retrieval/checklist score | retrieval-oriented check ids under `sub_scores.checks.*` | |
| 80 | +| `find_and_prove` | composite regression-proof score | stable assertion ids under `sub_scores.checks.*` | |
| 81 | +| `repo_state_heuristic` | repo-state score | stable assertion ids under `sub_scores.checks.*` | |
| 82 | +| `continuous` | family-defined continuous score | `continuous_score` or family-specific metric key | |
| 83 | +| `binary` | `0.0` or `1.0` | `binary_pass` | |
| 84 | + |
| 85 | +The family determines the expected `sub_scores` shape, but not the top-level |
| 86 | +contract. The top-level keys stay fixed across all families. |
| 87 | + |
| 88 | +## Migration Rules |
| 89 | + |
| 90 | +Normalize current canonical tasks using these rules: |
| 91 | + |
| 92 | +1. Existing `validation_result.json` payloads: |
| 93 | + Preserve legacy fields if needed, but add the canonical required keys. |
| 94 | +2. Existing `reward.json` payloads: |
| 95 | + Wrap the scalar score into the canonical schema and keep the old file only as |
| 96 | + a temporary compatibility artifact. |
| 97 | +3. `reward.txt`-only tasks: |
| 98 | + Keep `reward.txt`, add `validation_result.json`, and emit `sub_scores={}` if |
| 99 | + the family has no natural breakdown today. |
| 100 | +4. Missing required task output: |
| 101 | + Emit `status="invalid_output"`, `scorable=false`, `reward=0.0`, |
| 102 | + `passed=false`, and a populated `failure` object. |
| 103 | +5. Verifier exceptions: |
| 104 | + Emit `status="verifier_error"`, `scorable=false`, `reward=0.0`, |
| 105 | + `passed=false`, and a populated `failure` object. |
| 106 | + |
| 107 | +## Minimal Scored Example |
| 108 | + |
| 109 | +```json |
| 110 | +{ |
| 111 | + "schema_version": "validation_result.v1alpha1", |
| 112 | + "status": "scored", |
| 113 | + "scorable": true, |
| 114 | + "scorer_family": "test_ratio", |
| 115 | + "reward": 0.75, |
| 116 | + "pass_threshold": 0.0, |
| 117 | + "passed": true, |
| 118 | + "output_contract": { |
| 119 | + "mode": "repo_state", |
| 120 | + "primary_path": null, |
| 121 | + "required_artifact": false |
| 122 | + }, |
| 123 | + "sub_scores": {}, |
| 124 | + "failure": null |
| 125 | +} |
| 126 | +``` |
| 127 | + |
| 128 | +## Minimal Invalid-Output Example |
| 129 | + |
| 130 | +```json |
| 131 | +{ |
| 132 | + "schema_version": "validation_result.v1alpha1", |
| 133 | + "status": "invalid_output", |
| 134 | + "scorable": false, |
| 135 | + "scorer_family": "oracle_checks", |
| 136 | + "reward": 0.0, |
| 137 | + "pass_threshold": 0.0, |
| 138 | + "passed": false, |
| 139 | + "output_contract": { |
| 140 | + "mode": "answer_json_native", |
| 141 | + "primary_path": "/workspace/answer.json", |
| 142 | + "required_artifact": true |
| 143 | + }, |
| 144 | + "sub_scores": {}, |
| 145 | + "failure": { |
| 146 | + "code": "missing_required_output", |
| 147 | + "message": "answer.json not found at /workspace/answer.json", |
| 148 | + "stage": "output_validation" |
| 149 | + } |
| 150 | +} |
| 151 | +``` |
0 commit comments