fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
Open
noahgift wants to merge 2 commits into
Open
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…@1=1.0 on broken models (PMAT-702)
Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval`
path silently "fell back to structural validation" when inference failed for
all samples — marking every problem with a non-empty canonical_solution as
pass=1. On a completely broken model (e.g. the 10K Stage D gibberish
checkpoint that hid the no-KD training run), this produced:
pass_at_k[0].rate = 1.0 (164/164 problems "passed")
mode = "structural"
exit_code = 0
…with only the `mode` field signaling anything was wrong. JSON-parsing CI
gates and eval dashboards that key off `pass@1` saw a fully-passing model
when nothing had actually run.
## Fix
`crates/apr-cli/src/commands/eval/inference.rs`
* `run_humaneval` (the bug site, ~line 134): when inference fails for all
samples, emit a structured `mode: "inference_failed"` JSON with
`inference_error` populated and pass counters all zero, then return
`CliError::InferenceFailed` (exit code 8). No more dataset-side
pass-marking.
* `run_mbpp` (~line 1513): MBPP was already returning Err on failure but
did not emit JSON. Aligned to the same shape for parity per the
contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`,
`inference_error` populated, exit code 8.
`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`
* Removed misleading "falling back to structural validation" log line.
The new caller path is the source of truth; the loop just aborts.
## Contract
`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml`
* 3 equations: `pass_at_k_definition`, `inference_failure_signal`,
`per_problem_pass_counter_invariant`.
* 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0;
healthy model → real pass@k; humaneval/mbpp parity; no dataset-side
pass-marking.
* 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001.
* Validates clean: `pv validate` reports 0 errors, 0 warnings.
## Verification on gx10
`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:
"passed": 0
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...]
"mode": "inference_failed"
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
…and exit code 8 (CliError::InferenceFailed). Pre-fix the same command
emitted pass@1=1.0 with exit code 0.
## Why this matters
This was the third defect in the original PMAT-701 finding. Without it,
the Phase 4 Stage D no-KD training run would have continued to pass
HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B
landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) +
PR #1871 (spec amendment + dispatch default), the full distill→eval
pipeline is now honest end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.
The bug
When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.
Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):
```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`
Fix
`crates/apr-cli/src/commands/eval/inference.rs`:
`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:
Contract
`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):
Verification on gx10
`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:
```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)
Cascade context
This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.
Test plan
🤖 Generated with Claude Code