fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) by noahgift · Pull Request #1874 · paiml/aprender

noahgift · 2026-05-22T07:47:41Z

Summary

Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.

The bug

When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.

Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):

```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`

Fix

`crates/apr-cli/src/commands/eval/inference.rs`:

`run_humaneval` (the bug site, ~line 134): emit structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8).
`run_mbpp` (~line 1513): MBPP already returned Err on failure but didn't emit JSON. Aligned to the same shape per FT-EVAL-FAILURE-003 parity falsifier.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:

Removed misleading "falling back to structural validation" log line.

Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):

3 equations: pass@k definition, failure-signal coupling, per-problem counter invariant
4 falsifiers: FT-EVAL-FAILURE-001 (broken model→0), 002 (healthy model→real), 003 (humaneval/mbpp parity), 004 (no dataset-side marking)
2 Kani harnesses; qa_gate F-EVAL-FAILURE-001

Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)

Cascade context

This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo fmt -p apr-cli --check` — clean
`pv validate contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` — clean
FT-EVAL-FAILURE-001 verified on gx10 (gibberish checkpoint → `pass@1=0.0`, exit 8)
CI: `ci / gate` + `workspace-test` green
Follow-up: run on a known-healthy 7B Q4K teacher to verify FT-EVAL-FAILURE-002 (real `pass@k` still works)

🤖 Generated with Claude Code

…@1=1.0 on broken models (PMAT-702) Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval` path silently "fell back to structural validation" when inference failed for all samples — marking every problem with a non-empty canonical_solution as pass=1. On a completely broken model (e.g. the 10K Stage D gibberish checkpoint that hid the no-KD training run), this produced: pass_at_k[0].rate = 1.0 (164/164 problems "passed") mode = "structural" exit_code = 0 …with only the `mode` field signaling anything was wrong. JSON-parsing CI gates and eval dashboards that key off `pass@1` saw a fully-passing model when nothing had actually run. ## Fix `crates/apr-cli/src/commands/eval/inference.rs` * `run_humaneval` (the bug site, ~line 134): when inference fails for all samples, emit a structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8). No more dataset-side pass-marking. * `run_mbpp` (~line 1513): MBPP was already returning Err on failure but did not emit JSON. Aligned to the same shape for parity per the contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`, `inference_error` populated, exit code 8. `crates/apr-cli/src/commands/eval/code_eval.rs:425-428` * Removed misleading "falling back to structural validation" log line. The new caller path is the source of truth; the loop just aborts. ## Contract `contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` * 3 equations: `pass_at_k_definition`, `inference_failure_signal`, `per_problem_pass_counter_invariant`. * 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0; healthy model → real pass@k; humaneval/mbpp parity; no dataset-side pass-marking. * 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001. * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 `evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`: "passed": 0 "pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...] "mode": "inference_failed" "inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)" …and exit code 8 (CliError::InferenceFailed). Pre-fix the same command emitted pass@1=1.0 with exit code 0. ## Why this matters This was the third defect in the original PMAT-701 finding. Without it, the Phase 4 Stage D no-KD training run would have continued to pass HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) + PR #1871 (spec amendment + dispatch default), the full distill→eval pipeline is now honest end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 07:47

Merge branch 'main' into fix/apr-eval-inference-failure-pmat-702

cf0b4ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
noahgift wants to merge 2 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

The bug

Fix

Contract

Verification on gx10

Cascade context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant