Skip to content

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890

Closed
noahgift wants to merge 1 commit into
mainfrom
fix/1864-golden-output-stop-tokens
Closed

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS#1890
noahgift wants to merge 1 commit into
mainfrom
fix/1864-golden-output-stop-tokens

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

TL;DR

Closes #1864. The "cuBLAS FP8 7B Q4K gibberish" was a missing-stop_tokens config in the QA gate, not a numerical bug. User instinct ("could this be a chat template or something simple?") was correct.

Falsification path

Hypothesis Falsifier Result
Deep cuBLAS FP8 numerical bug Run my Stage A reproducer Yes, single-step argmax mismatches by 1057 vs 75311, correlation 0.987
Bug surfaces in user-visible HTTP apr serve run <7B GGUF> + curl /v1/chat/completions ❌ User-visible path returns '2+2 equals 4.' CORRECTLY
Then why does Golden Output FAIL? Inspect gate's gen_config ..Default::default()stop_tokens: Vec::new()
What does apr serve set? Read cuda_chat_backend.rs:113 stop_tokens: vec![eos_token_id]

Five-whys:

  1. Golden Output FAILs with <|im_start|> repeats — why?
  2. Generation runs the full 512 max_tokens — why?
  3. No EOS in stop_tokens, defaults to empty — why?
  4. The gate's gen_config uses ..Default::default() without overriding it — why?
  5. The author of the gate didn't realize Default::stop_tokens = Vec::new(). Root cause = config drift between gate and production path.

Live discharge (RTX 4090, this host)

Pre-fix:

$ apr qa qwen2.5-coder-7b-instruct-q4_k_m.gguf
✗ FAIL Golden Output GPU output failed (CPU passed): gibberish (fragment "<|im_start|>" repeats 3+ times)

Post-fix:

$ apr qa qwen2.5-coder-7b-instruct-q4_k_m.gguf
✓ PASS Golden Output 2 golden test cases passed

Diff

Two 4-line changes in crates/apr-cli/src/commands/golden_output.rs:

  • CPU path (line 50-57): rename bosspecials, add stop_tokens: vec![specials.eos_id]
  • GPU path (line 153-170): same pattern

Total: 16 insertions, 4 deletions in 1 file. The "multi-day cuBLAS expert work" estimate was wrong — this was 5 lines.

What this revokes

  • v0.35.0 release HELD justification → UNBLOCKED
  • SPEC-CUBLAS-FP8-7B-FIX-001 Stages C-F → no longer needed; Stage A reproducer + Stage B per-layer instrumentation remain useful but not blocking. Spec status → SUPERSEDED.

What still holds

Methodology lesson

Worth saving to memory/feedback_falsify_simple_before_deep.md:

When a test gate FAILs with a complex-looking symptom (e.g. "FP8 numerical drift"), the first falsifier should be: does the user-visible code path that the test purports to verify ALSO fail, or just the test gate? If only the test fails, the bug is in the test config, not the implementation. ~2 minutes of apr serve + curl would have saved days of bisection.

Test plan

  • apr qa <7B Q4K GGUF> now PASSes Golden Output (live verified)
  • apr qa <1.5B Q4K APR> continues to PASS (no regression)
  • apr serve run <7B> + curl produces correct output (verified pre and post)
  • cargo test -p aprender-contracts --lib lint_passes_on_real_contracts
  • CI: workspace-test, fmt, contracts-lib

🤖 Generated with Claude Code

 cuBLAS

The "cuBLAS FP8 7B Q4K gibberish" surfaced by apr qa Golden Output is NOT a
numerical bug — it's a missing-stop-tokens config in the gate itself.

## Five-whys

1. **Why does `apr qa <7B Q4K>` Golden Output FAIL with `<|im_start|>` repeats?**
   Generation runs the full 512 max_tokens budget without ever stopping.
2. **Why does it never stop?** `QuantizedGenerateConfig::default()` initializes
   `stop_tokens: Vec::new()`, and the gate's config used `..Default::default()`
   without overriding it.
3. **Why is the output `<|im_start|>`?** With no EOS gate, after the assistant
   produces "4" (the real answer), the model continues generating from
   in-distribution tokens. ChatML training makes `<|im_start|>` highly probable
   in continuation contexts → degenerate repeat pattern.
4. **Why does `apr serve` produce correct output on the same model?**
   `cuda_chat_backend.rs:113` correctly sets `stop_tokens: vec![eos_token_id]`,
   so generation terminates at `<|im_end|>` and the user sees "4".
5. **Why didn't I catch this before chasing a deep numerical bug?**
   I assumed the contract gate was correctly configured and that its FAIL
   meant the model was broken. Falsification rule: ALWAYS test whether the
   simple hypothesis (config gap) explains the data before assuming
   complex causes (FP8 algorithm/scale).

## Live discharge on 7B Q4K GGUF (noah-Lambda-Vector RTX 4090)

Pre-fix:  `✗ FAIL Golden Output GPU output failed (CPU passed): gibberish`
Post-fix: `✓ PASS Golden Output 2 golden test cases passed`

The model was always producing the correct first token "4" — the test gate
just kept asking it to generate beyond that, into noise.

## What this revokes

- v0.35.0 release HELD justification was based on a phantom — UNBLOCKS
- SPEC-CUBLAS-FP8-7B-FIX-001 Stages C-F are no longer needed; Stage A's
  reproducer + Stage B's per-layer instrumentation remain useful general
  diagnostics but not for THIS bug. Spec status → SUPERSEDED.

## What's still true

- The reproducer in Stage A correctly observes ~3e-3 abs-diff per element
  between CPU vs cuBLAS FP8 single-step forward. This is **expected FP8
  precision**, not a defect. Stage B documented it as "FP8 precision floor",
  which is correct — just not the cause of any user-visible bug.
- The wgpu side of #1864 (PR #1876 multi-step parity gate) remains a real
  improvement: it catches autoregressive drift that would otherwise produce
  silent gibberish on wgpu-only paths.

## Files

- `crates/apr-cli/src/commands/golden_output.rs`: add `stop_tokens: vec![specials.eos_id]`
  to both the CPU `golden_output_gguf_cpu` and GPU `validate_gpu_golden_output`
  gen_configs. Uses `SpecialTokens::qwen2().eos_id` (151645).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Subsumed by #1894 (release PR for v0.35.0). The squash-merge into release/v0.35.0 preserves the per-PR commit message and changes — see PR #1894 commit log. Closing as superseded.

@noahgift noahgift closed this May 22, 2026
auto-merge was automatically disabled May 22, 2026 14:59

Pull request was closed

noahgift added a commit that referenced this pull request May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline)

* chore: README drift fix + apr serve syntax (#1873)

* feat(cublas-fp8): Stage A deterministic reproducer (#1884)

* feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887)

* fix(1864): Golden Output gate must set stop_tokens (#1890)

* chore(release): bump to v0.35.0 + CHANGELOG + README contract count

Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites
+ regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit
release scope:

1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121
2. MoE (Qwen3) KV cache + streaming SSE + sampling
3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line
   stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see
   feedback_falsify_simple_before_deep.md

README contract count 1151 → 1153 (post Stage A + Stage B contracts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559

1 participant