Skip to content

Commit a227d2e

Browse files
committed
Define canonical validation_result contract
1 parent 1c19f51 commit a227d2e

11 files changed

+2499
-29
lines changed

configs/canonical_evaluation_audit.json

Lines changed: 1963 additions & 1 deletion
Large diffs are not rendered by default.

docs/AGENT_INTERFACE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,10 @@ The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the cont
4646

4747
1. Runs inside the container after the agent completes
4848
2. Writes a reward to `/logs/verifier/reward.txt` as a plain decimal float (0.0-1.0)
49-
3. Always exits 0 (Harbor reads the reward value, not the exit code)
49+
3. For canonical tasks, should also write `/logs/verifier/validation_result.json`
50+
following `docs/reference/VALIDATION_RESULT_SCHEMA.md`
51+
4. May use non-zero exit codes to distinguish scored failure from verifier/runtime failure;
52+
Harbor still reads the scalar reward artifact
5053

5154
### Result Format
5255

docs/EVALUATION_PIPELINE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,11 @@ Harbor run output (result.json, transcript)
5050

5151
Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (Org
5252
tasks) that runs inside the Docker container after the agent finishes. The
53-
verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt` and exits 0
54-
on success, non-zero on failure.
53+
verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt`. Canonical
54+
tasks should also emit `/logs/verifier/validation_result.json` using the schema
55+
in [docs/reference/VALIDATION_RESULT_SCHEMA.md](reference/VALIDATION_RESULT_SCHEMA.md)
56+
so downstream reporting can preserve scorer family, pass semantics, and failure
57+
context.
5558

5659
Verifier types are documented in [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md).
5760

docs/REPORT_CONTEXT.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,9 @@ The evaluation uses a multi-layer pipeline:
135135
1. **Deterministic verifier** (every task): Task-specific `test.sh` or
136136
`eval.sh` runs inside the Docker container after the agent finishes.
137137
Produces a reward score (0.0--1.0) written to `/logs/verifier/reward.txt`.
138+
Canonical tasks are converging on a paired
139+
`/logs/verifier/validation_result.json` sidecar so scorer family,
140+
pass/fail semantics, sub-scores, and invalid-output context are preserved.
138141

139142
2. **Optional LLM judge**: Post-hoc qualitative scoring across five
140143
dimensions (correctness 0.30, completeness 0.25, code quality 0.20,

docs/SCORING_SEMANTICS.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ How each benchmark is scored, what the numbers mean, and known limitations.
1414
| **ordering** | 0.0–1.0 continuous | Position-exact-match blended with rank correlation |
1515
| **external** | 0.0–1.0 continuous | External verifier (e.g., TheAgentCompany eval) |
1616

17+
Canonical tasks should normalize these families into
18+
`/logs/verifier/validation_result.json` using
19+
`docs/reference/VALIDATION_RESULT_SCHEMA.md`. The reward type determines the
20+
meaning of `reward` and `sub_scores`, but not the top-level contract.
21+
1722
## Per-Verifier Scoring (Active Suites)
1823

1924
Tasks are organized into 8 SDLC-phase suites (`csb_sdlc_understand` through `csb_sdlc_debug`)

docs/reference/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Stable specifications and policy/reference documents.
1111
## Evaluation / Scoring
1212
- `docs/SCORING_SEMANTICS.md`
1313
- `docs/EVALUATION_PIPELINE.md`
14+
- `docs/reference/VALIDATION_RESULT_SCHEMA.md`
1415
- `docs/RETRIEVAL_EVAL_SPEC.md`
1516
- `docs/ORG_CALIBRATION.md`
1617

docs/reference/TASK_CONTRACT.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,17 @@ Missing required output is an invalid run when the verifier cannot establish
7070
that the agent actually produced a scorable answer. It should not silently look
7171
like a clean `reward=0.0` benchmark miss.
7272

73+
Canonical tasks should treat `reward.txt` as the scalar compatibility artifact
74+
and `validation_result.json` as the semantic verifier contract. The JSON
75+
sidecar is where verifiers should record scorer family, pass semantics,
76+
sub-scores, and failure context.
77+
7378
At minimum, verifiers should:
7479

7580
- emit a clear error for missing required output
7681
- write reward artifacts only after classifying whether the run was scorable
82+
- for canonical tasks, also write `/logs/verifier/validation_result.json`
83+
using `docs/reference/VALIDATION_RESULT_SCHEMA.md`
7784
- avoid hardcoded assumptions like `cd /app` unless that is the published task contract
7885

7986
## Image Variant Parity
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Validation Result Schema
2+
3+
Canonical tasks should converge on a structured verifier sidecar at
4+
`/logs/verifier/validation_result.json` in addition to the required
5+
`/logs/verifier/reward.txt`.
6+
7+
This schema standardizes verifier semantics across scalar-only shell verifiers,
8+
answer.json artifact verifiers, repo-state verifiers, and oracle-based promoted
9+
tasks. It is intentionally simple enough to emit from shell or Python.
10+
11+
## Required Top-Level Fields
12+
13+
Every canonical `validation_result.json` should emit these keys, even when the
14+
verifier fails or the run is invalid:
15+
16+
| Field | Type | Meaning |
17+
|-------|------|---------|
18+
| `schema_version` | string | Schema identifier. Current proposal: `validation_result.v1alpha1` |
19+
| `status` | string | One of `scored`, `invalid_output`, `verifier_error` |
20+
| `scorable` | boolean | `true` when the verifier had enough task output to score the run |
21+
| `scorer_family` | string | Normalized verifier family (`oracle_checks`, `test_ratio`, `f1_hybrid`, etc.) |
22+
| `reward` | number | Canonical scalar reward in `[0.0, 1.0]` |
23+
| `pass_threshold` | number | Family/task policy threshold associated with `passed` |
24+
| `passed` | boolean | Authoritative pass/fail flag for the task outcome |
25+
| `output_contract` | object | Published verifier-facing output mode and primary artifact path |
26+
| `sub_scores` | object | Per-check or per-component scores. Use `{}` when the family is scalar-only |
27+
| `failure` | object or `null` | Structured failure/error context. `null` for scored runs |
28+
29+
Downstream consumers should treat `passed` as authoritative. `pass_threshold`
30+
is included so reporting can preserve task policy, but parsers should not
31+
recompute `passed` from `reward` alone.
32+
33+
## Required `output_contract` Fields
34+
35+
`output_contract` should always contain:
36+
37+
| Field | Type | Meaning |
38+
|-------|------|---------|
39+
| `mode` | string | One of `answer_json_native`, `answer_json_bridge`, `repo_state`, `solution_json`, `report_markdown`, `unspecified` |
40+
| `primary_path` | string or `null` | Primary artifact path the verifier expected, if any |
41+
| `required_artifact` | boolean | Whether a missing primary artifact makes the run unscorable |
42+
43+
## Failure Object
44+
45+
When `status != "scored"`, `failure` should be populated with:
46+
47+
| Field | Type | Meaning |
48+
|-------|------|---------|
49+
| `code` | string | Stable machine-readable error code (`missing_required_output`, `invalid_json`, `verifier_exception`, etc.) |
50+
| `message` | string | Human-readable summary |
51+
| `stage` | string | Usually `output_validation`, `scoring`, or `verifier_runtime` |
52+
53+
`reward` should still be written as `0.0` so existing Harbor/reporting flows do
54+
not break while migration is in progress.
55+
56+
## Optional Fields
57+
58+
These fields are recommended when available:
59+
60+
- `details`: family-specific raw verifier data or diagnostics
61+
- `artifacts`: paths to helper files emitted by the verifier
62+
- `timing`: verifier-local timing if a family captures it
63+
- `legacy`: compatibility payloads preserved for existing readers
64+
65+
## Family Mapping
66+
67+
Current canonical verifier families should map into the schema as follows:
68+
69+
| Family | Canonical `reward` | Recommended `sub_scores` |
70+
|-------|---------------------|--------------------------|
71+
| `oracle_checks` | suite-weighted composite | one entry per oracle check (`file_set_match`, `symbol_resolution`, `dependency_chain`, `keyword_presence`, `provenance`, `json_schema_match`, `test_ratio`) |
72+
| `semantic_retrieval_qa` | primary QA correctness score | `correct_function`, `correct_path`, `justification_score` |
73+
| `f1_hybrid` | blended detection/fix score | `detection_f1`, `fix_score` |
74+
| `f1` | F1 score | `precision`, `recall`, `f1` |
75+
| `test_ratio` | pass ratio | `tests_passed_ratio` plus counts in `details` |
76+
| `diff_similarity` | diff similarity composite | `file_recall`, `line_recall`, `line_precision` |
77+
| `semantic_similarity` | similarity score | `similarity` |
78+
| `checklist` | weighted checklist score | stable check ids under `sub_scores.checks.*` |
79+
| `ir_checklist` | blended retrieval/checklist score | retrieval-oriented check ids under `sub_scores.checks.*` |
80+
| `find_and_prove` | composite regression-proof score | stable assertion ids under `sub_scores.checks.*` |
81+
| `repo_state_heuristic` | repo-state score | stable assertion ids under `sub_scores.checks.*` |
82+
| `continuous` | family-defined continuous score | `continuous_score` or family-specific metric key |
83+
| `binary` | `0.0` or `1.0` | `binary_pass` |
84+
85+
The family determines the expected `sub_scores` shape, but not the top-level
86+
contract. The top-level keys stay fixed across all families.
87+
88+
## Migration Rules
89+
90+
Normalize current canonical tasks using these rules:
91+
92+
1. Existing `validation_result.json` payloads:
93+
Preserve legacy fields if needed, but add the canonical required keys.
94+
2. Existing `reward.json` payloads:
95+
Wrap the scalar score into the canonical schema and keep the old file only as
96+
a temporary compatibility artifact.
97+
3. `reward.txt`-only tasks:
98+
Keep `reward.txt`, add `validation_result.json`, and emit `sub_scores={}` if
99+
the family has no natural breakdown today.
100+
4. Missing required task output:
101+
Emit `status="invalid_output"`, `scorable=false`, `reward=0.0`,
102+
`passed=false`, and a populated `failure` object.
103+
5. Verifier exceptions:
104+
Emit `status="verifier_error"`, `scorable=false`, `reward=0.0`,
105+
`passed=false`, and a populated `failure` object.
106+
107+
## Minimal Scored Example
108+
109+
```json
110+
{
111+
"schema_version": "validation_result.v1alpha1",
112+
"status": "scored",
113+
"scorable": true,
114+
"scorer_family": "test_ratio",
115+
"reward": 0.75,
116+
"pass_threshold": 0.0,
117+
"passed": true,
118+
"output_contract": {
119+
"mode": "repo_state",
120+
"primary_path": null,
121+
"required_artifact": false
122+
},
123+
"sub_scores": {},
124+
"failure": null
125+
}
126+
```
127+
128+
## Minimal Invalid-Output Example
129+
130+
```json
131+
{
132+
"schema_version": "validation_result.v1alpha1",
133+
"status": "invalid_output",
134+
"scorable": false,
135+
"scorer_family": "oracle_checks",
136+
"reward": 0.0,
137+
"pass_threshold": 0.0,
138+
"passed": false,
139+
"output_contract": {
140+
"mode": "answer_json_native",
141+
"primary_path": "/workspace/answer.json",
142+
"required_artifact": true
143+
},
144+
"sub_scores": {},
145+
"failure": {
146+
"code": "missing_required_output",
147+
"message": "answer.json not found at /workspace/answer.json",
148+
"stage": "output_validation"
149+
}
150+
}
151+
```

0 commit comments

Comments
 (0)