fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS by noahgift · Pull Request #1890 · paiml/aprender

noahgift · 2026-05-22T14:21:49Z

TL;DR

Closes #1864. The "cuBLAS FP8 7B Q4K gibberish" was a missing-stop_tokens config in the QA gate, not a numerical bug. User instinct ("could this be a chat template or something simple?") was correct.

Falsification path

Hypothesis	Falsifier	Result
Deep cuBLAS FP8 numerical bug	Run my Stage A reproducer	Yes, single-step argmax mismatches by 1057 vs 75311, correlation 0.987
Bug surfaces in user-visible HTTP	`apr serve run <7B GGUF>` + curl `/v1/chat/completions`	❌ User-visible path returns `'2+2 equals 4.'` CORRECTLY
Then why does Golden Output FAIL?	Inspect gate's `gen_config`	`..Default::default()` → `stop_tokens: Vec::new()`
What does `apr serve` set?	Read `cuda_chat_backend.rs:113`	`stop_tokens: vec![eos_token_id]`

Five-whys:

Golden Output FAILs with <|im_start|> repeats — why?
Generation runs the full 512 max_tokens — why?
No EOS in stop_tokens, defaults to empty — why?
The gate's gen_config uses ..Default::default() without overriding it — why?
The author of the gate didn't realize Default::stop_tokens = Vec::new(). Root cause = config drift between gate and production path.

Live discharge (RTX 4090, this host)

Pre-fix:

$ apr qa qwen2.5-coder-7b-instruct-q4_k_m.gguf
✗ FAIL Golden Output GPU output failed (CPU passed): gibberish (fragment "<|im_start|>" repeats 3+ times)

Post-fix:

$ apr qa qwen2.5-coder-7b-instruct-q4_k_m.gguf
✓ PASS Golden Output 2 golden test cases passed

Diff

Two 4-line changes in crates/apr-cli/src/commands/golden_output.rs:

CPU path (line 50-57): rename bos → specials, add stop_tokens: vec![specials.eos_id]
GPU path (line 153-170): same pattern

Total: 16 insertions, 4 deletions in 1 file. The "multi-day cuBLAS expert work" estimate was wrong — this was 5 lines.

What this revokes

v0.35.0 release HELD justification → UNBLOCKED
SPEC-CUBLAS-FP8-7B-FIX-001 Stages C-F → no longer needed; Stage A reproducer + Stage B per-layer instrumentation remain useful but not blocking. Spec status → SUPERSEDED.

What still holds

The reproducer in PR feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A) #1884 (Stage A) correctly observes ~3e-3 abs-diff per element between CPU vs cuBLAS FP8 single-step forward. This is expected FP8 precision, not a defect. Stage B (PR feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B) #1887) documented it correctly as "FP8 precision floor".
PR fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu) #1876 wgpu multi-step parity gate remains a real improvement: catches autoregressive drift that would otherwise produce silent gibberish on wgpu-only paths.

Methodology lesson

Worth saving to memory/feedback_falsify_simple_before_deep.md:

When a test gate FAILs with a complex-looking symptom (e.g. "FP8 numerical drift"), the first falsifier should be: does the user-visible code path that the test purports to verify ALSO fail, or just the test gate? If only the test fails, the bug is in the test config, not the implementation. ~2 minutes of apr serve + curl would have saved days of bisection.

Test plan

apr qa <7B Q4K GGUF> now PASSes Golden Output (live verified)
apr qa <1.5B Q4K APR> continues to PASS (no regression)
apr serve run <7B> + curl produces correct output (verified pre and post)
cargo test -p aprender-contracts --lib lint_passes_on_real_contracts
CI: workspace-test, fmt, contracts-lib

🤖 Generated with Claude Code

cuBLAS The "cuBLAS FP8 7B Q4K gibberish" surfaced by apr qa Golden Output is NOT a numerical bug — it's a missing-stop-tokens config in the gate itself. ## Five-whys 1. **Why does `apr qa <7B Q4K>` Golden Output FAIL with `<|im_start|>` repeats?** Generation runs the full 512 max_tokens budget without ever stopping. 2. **Why does it never stop?** `QuantizedGenerateConfig::default()` initializes `stop_tokens: Vec::new()`, and the gate's config used `..Default::default()` without overriding it. 3. **Why is the output `<|im_start|>`?** With no EOS gate, after the assistant produces "4" (the real answer), the model continues generating from in-distribution tokens. ChatML training makes `<|im_start|>` highly probable in continuation contexts → degenerate repeat pattern. 4. **Why does `apr serve` produce correct output on the same model?** `cuda_chat_backend.rs:113` correctly sets `stop_tokens: vec![eos_token_id]`, so generation terminates at `<|im_end|>` and the user sees "4". 5. **Why didn't I catch this before chasing a deep numerical bug?** I assumed the contract gate was correctly configured and that its FAIL meant the model was broken. Falsification rule: ALWAYS test whether the simple hypothesis (config gap) explains the data before assuming complex causes (FP8 algorithm/scale). ## Live discharge on 7B Q4K GGUF (noah-Lambda-Vector RTX 4090) Pre-fix: `✗ FAIL Golden Output GPU output failed (CPU passed): gibberish` Post-fix: `✓ PASS Golden Output 2 golden test cases passed` The model was always producing the correct first token "4" — the test gate just kept asking it to generate beyond that, into noise. ## What this revokes - v0.35.0 release HELD justification was based on a phantom — UNBLOCKS - SPEC-CUBLAS-FP8-7B-FIX-001 Stages C-F are no longer needed; Stage A's reproducer + Stage B's per-layer instrumentation remain useful general diagnostics but not for THIS bug. Spec status → SUPERSEDED. ## What's still true - The reproducer in Stage A correctly observes ~3e-3 abs-diff per element between CPU vs cuBLAS FP8 single-step forward. This is **expected FP8 precision**, not a defect. Stage B documented it as "FP8 precision floor", which is correct — just not the cause of any user-visible bug. - The wgpu side of #1864 (PR #1876 multi-step parity gate) remains a real improvement: it catches autoregressive drift that would otherwise produce silent gibberish on wgpu-only paths. ## Files - `crates/apr-cli/src/commands/golden_output.rs`: add `stop_tokens: vec![specials.eos_id]` to both the CPU `golden_output_gguf_cpu` and GPU `validate_gpu_golden_output` gen_configs. Uses `SpecialTokens::qwen2().eos_id` (151645). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-22T14:59:07Z

Subsumed by #1894 (release PR for v0.35.0). The squash-merge into release/v0.35.0 preserves the per-PR commit message and changes — see PR #1894 commit log. Closing as superseded.

* chore(fmt): cargo fmt --all (release v0.35.0 baseline) * chore: README drift fix + apr serve syntax (#1873) * feat(cublas-fp8): Stage A deterministic reproducer (#1884) * feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887) * fix(1864): Golden Output gate must set stop_tokens (#1890) * chore(release): bump to v0.35.0 + CHANGELOG + README contract count Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites + regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit release scope: 1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121 2. MoE (Qwen3) KV cache + streaming SSE + sampling 3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see feedback_falsify_simple_before_deep.md README contract count 1151 → 1153 (post Stage A + Stage B contracts).

noahgift enabled auto-merge (squash) May 22, 2026 14:21

This was referenced May 22, 2026

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) #1882

Closed

release: v0.35.0 (subsumes #1873, #1884, #1887, #1890) #1894

Merged

noahgift closed this May 22, 2026

auto-merge was automatically disabled May 22, 2026 14:59
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890
noahgift wants to merge 1 commit into
mainfrom
fix/1864-golden-output-stop-tokens

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

TL;DR

Falsification path

Live discharge (RTX 4090, this host)

Diff

What this revokes

What still holds

Methodology lesson

Test plan

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant