Skip to content

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874

Open
noahgift wants to merge 2 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702
Open

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
noahgift wants to merge 2 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.

The bug

When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.

Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):

```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`

Fix

`crates/apr-cli/src/commands/eval/inference.rs`:

  • `run_humaneval` (the bug site, ~line 134): emit structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8).
  • `run_mbpp` (~line 1513): MBPP already returned Err on failure but didn't emit JSON. Aligned to the same shape per FT-EVAL-FAILURE-003 parity falsifier.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:

  • Removed misleading "falling back to structural validation" log line.

Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):

  • 3 equations: pass@k definition, failure-signal coupling, per-problem counter invariant
  • 4 falsifiers: FT-EVAL-FAILURE-001 (broken model→0), 002 (healthy model→real), 003 (humaneval/mbpp parity), 004 (no dataset-side marking)
  • 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001

Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)

Cascade context

This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo fmt -p apr-cli --check` — clean
  • `pv validate contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` — clean
  • FT-EVAL-FAILURE-001 verified on gx10 (gibberish checkpoint → `pass@1=0.0`, exit 8)
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: run on a known-healthy 7B Q4K teacher to verify FT-EVAL-FAILURE-002 (real `pass@k` still works)

🤖 Generated with Claude Code

…@1=1.0 on broken models (PMAT-702)

Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval`
path silently "fell back to structural validation" when inference failed for
all samples — marking every problem with a non-empty canonical_solution as
pass=1. On a completely broken model (e.g. the 10K Stage D gibberish
checkpoint that hid the no-KD training run), this produced:

  pass_at_k[0].rate = 1.0  (164/164 problems "passed")
  mode = "structural"
  exit_code = 0

…with only the `mode` field signaling anything was wrong. JSON-parsing CI
gates and eval dashboards that key off `pass@1` saw a fully-passing model
when nothing had actually run.

## Fix

`crates/apr-cli/src/commands/eval/inference.rs`

* `run_humaneval` (the bug site, ~line 134): when inference fails for all
  samples, emit a structured `mode: "inference_failed"` JSON with
  `inference_error` populated and pass counters all zero, then return
  `CliError::InferenceFailed` (exit code 8). No more dataset-side
  pass-marking.
* `run_mbpp` (~line 1513): MBPP was already returning Err on failure but
  did not emit JSON. Aligned to the same shape for parity per the
  contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`,
  `inference_error` populated, exit code 8.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`

* Removed misleading "falling back to structural validation" log line.
  The new caller path is the source of truth; the loop just aborts.

## Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml`

* 3 equations: `pass_at_k_definition`, `inference_failure_signal`,
  `per_problem_pass_counter_invariant`.
* 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0;
  healthy model → real pass@k; humaneval/mbpp parity; no dataset-side
  pass-marking.
* 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001.
* Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

  "passed": 0
  "pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...]
  "mode": "inference_failed"
  "inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"

…and exit code 8 (CliError::InferenceFailed). Pre-fix the same command
emitted pass@1=1.0 with exit code 0.

## Why this matters

This was the third defect in the original PMAT-701 finding. Without it,
the Phase 4 Stage D no-KD training run would have continued to pass
HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B
landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) +
PR #1871 (spec amendment + dispatch default), the full distill→eval
pipeline is now honest end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant