fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890
Closed
noahgift wants to merge 1 commit into
Closed
fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890noahgift wants to merge 1 commit into
noahgift wants to merge 1 commit into
Conversation
cuBLAS The "cuBLAS FP8 7B Q4K gibberish" surfaced by apr qa Golden Output is NOT a numerical bug — it's a missing-stop-tokens config in the gate itself. ## Five-whys 1. **Why does `apr qa <7B Q4K>` Golden Output FAIL with `<|im_start|>` repeats?** Generation runs the full 512 max_tokens budget without ever stopping. 2. **Why does it never stop?** `QuantizedGenerateConfig::default()` initializes `stop_tokens: Vec::new()`, and the gate's config used `..Default::default()` without overriding it. 3. **Why is the output `<|im_start|>`?** With no EOS gate, after the assistant produces "4" (the real answer), the model continues generating from in-distribution tokens. ChatML training makes `<|im_start|>` highly probable in continuation contexts → degenerate repeat pattern. 4. **Why does `apr serve` produce correct output on the same model?** `cuda_chat_backend.rs:113` correctly sets `stop_tokens: vec![eos_token_id]`, so generation terminates at `<|im_end|>` and the user sees "4". 5. **Why didn't I catch this before chasing a deep numerical bug?** I assumed the contract gate was correctly configured and that its FAIL meant the model was broken. Falsification rule: ALWAYS test whether the simple hypothesis (config gap) explains the data before assuming complex causes (FP8 algorithm/scale). ## Live discharge on 7B Q4K GGUF (noah-Lambda-Vector RTX 4090) Pre-fix: `✗ FAIL Golden Output GPU output failed (CPU passed): gibberish` Post-fix: `✓ PASS Golden Output 2 golden test cases passed` The model was always producing the correct first token "4" — the test gate just kept asking it to generate beyond that, into noise. ## What this revokes - v0.35.0 release HELD justification was based on a phantom — UNBLOCKS - SPEC-CUBLAS-FP8-7B-FIX-001 Stages C-F are no longer needed; Stage A's reproducer + Stage B's per-layer instrumentation remain useful general diagnostics but not for THIS bug. Spec status → SUPERSEDED. ## What's still true - The reproducer in Stage A correctly observes ~3e-3 abs-diff per element between CPU vs cuBLAS FP8 single-step forward. This is **expected FP8 precision**, not a defect. Stage B documented it as "FP8 precision floor", which is correct — just not the cause of any user-visible bug. - The wgpu side of #1864 (PR #1876 multi-step parity gate) remains a real improvement: it catches autoregressive drift that would otherwise produce silent gibberish on wgpu-only paths. ## Files - `crates/apr-cli/src/commands/golden_output.rs`: add `stop_tokens: vec![specials.eos_id]` to both the CPU `golden_output_gguf_cpu` and GPU `validate_gpu_golden_output` gen_configs. Uses `SpecialTokens::qwen2().eos_id` (151645). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
Closed
Contributor
Author
auto-merge was automatically disabled
May 22, 2026 14:59
Pull request was closed
noahgift
added a commit
that referenced
this pull request
May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline) * chore: README drift fix + apr serve syntax (#1873) * feat(cublas-fp8): Stage A deterministic reproducer (#1884) * feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887) * fix(1864): Golden Output gate must set stop_tokens (#1890) * chore(release): bump to v0.35.0 + CHANGELOG + README contract count Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites + regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit release scope: 1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121 2. MoE (Qwen3) KV cache + streaming SSE + sampling 3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see feedback_falsify_simple_before_deep.md README contract count 1151 → 1153 (post Stage A + Stage B contracts).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Closes #1864. The "cuBLAS FP8 7B Q4K gibberish" was a missing-
stop_tokensconfig in the QA gate, not a numerical bug. User instinct ("could this be a chat template or something simple?") was correct.Falsification path
apr serve run <7B GGUF>+ curl/v1/chat/completions'2+2 equals 4.'CORRECTLYgen_config..Default::default()→stop_tokens: Vec::new()apr serveset?cuda_chat_backend.rs:113stop_tokens: vec![eos_token_id]Five-whys:
<|im_start|>repeats — why?stop_tokens, defaults to empty — why?..Default::default()without overriding it — why?Default::stop_tokens = Vec::new(). Root cause = config drift between gate and production path.Live discharge (RTX 4090, this host)
Pre-fix:
Post-fix:
Diff
Two 4-line changes in
crates/apr-cli/src/commands/golden_output.rs:bos→specials, addstop_tokens: vec![specials.eos_id]Total: 16 insertions, 4 deletions in 1 file. The "multi-day cuBLAS expert work" estimate was wrong — this was 5 lines.
What this revokes
What still holds
Methodology lesson
Worth saving to
memory/feedback_falsify_simple_before_deep.md:Test plan
apr qa <7B Q4K GGUF>now PASSes Golden Output (live verified)apr qa <1.5B Q4K APR>continues to PASS (no regression)apr serve run <7B> + curlproduces correct output (verified pre and post)cargo test -p aprender-contracts --lib lint_passes_on_real_contracts🤖 Generated with Claude Code