spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882
Closed
noahgift wants to merge 2 commits into
Closed
spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…rish Authored 2026-05-22 after a 1.5-hour bisect + layer-trace investigation on noah-Lambda-Vector (RTX 4090) surfaced that the cuBLAS FP8 7B Q4K gibberish (`<|im_start|>` repeats) is NOT a recent regression but a pre-existing fragile path that's been broken across multiple releases (v0.31.2, v0.33.0, v0.34.0 all reproduce). The initial `git bisect run` identified `8bd4ce5a` as the first bad commit, but a retest at v0.31.2 showed the same gibberish — bisection was invalid because the test signal is non-deterministic (cuBLAS context poisoning recovery masks the symptom across some runs). Per `feedback_release_only_after_bug_hunt`, v0.35.0 release tag holds until this epic discharges. Individual fix PRs (#1867 #1868 #1870 #1872 #1873 #1875 #1876 #1878) continue to land on main as net-positive improvements. ## 6-stage falsifier cascade (per feedback_falsifier_cascade_decomposes_magnitude) - **Stage A**: deterministic reproducer (`cublas_fp8_7b_reproducer.rs`) — make the bug visible on 5/5 consecutive runs - **Stage B**: per-layer parity instrumentation — `APR_PER_LAYER_PARITY_DUMP=1` writes 28 layer JSONs with CPU vs GPU checksums + cosine, so the first divergent layer becomes greppable - **Stage C**: embed lookup parity (token_id → embedding vector) - **Stage D**: pre-attention RMSNorm parity (eps mismatch is a known class — PMAT-698n) - **Stage E**: Q/K/V FP8 matmul parity at shape (3584, batch, 3584) — most likely root cause space per the layer-0 trace - **Stage F**: actual fix + contract amendment + v0.35.0 unblock Each PR ships ~50-200 LOC. Each falsifier discharges with empirical evidence on the 7B Q4K teacher. Estimate 2-5 days of focused work. ## Why this is an epic, not a one-PR fix Tried in the initial investigation (all unsuccessful): - `git bisect run` (invalid — bug is intermittent) - Layer-by-layer trace (showed Layer 0 already diverges, but not which op) - Per-file reverts: cublas.rs math mode, cublas_prefill/attention.rs, weights.rs, flash_decoding_graphed.rs, rms_norm.rs (backward) - Dependency archaeology: trueno-gpu v0.4.36 + aprender-compute re-export shim — code itself didn't change The bug is pre-existing AND deep — likely cuBLASLt FP8 GEMM algorithm selection or FP8 scale calibration for hidden_dim=3584. Per the 5-whys for #1864, it's invisible until `apr qa` Golden Output is wired into CI (currently only fires on manual `/dogfood`). ## Out of scope - 30B-MoE inference (#1583, separate epic) - wgpu kernel-level fix (covered by #1864 sub-issue post-#1876) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Closing as SUPERSEDED by PR #1890. The 5-line stop_tokens fix in #1890 resolves #1864 — Stages C-F of this spec are unnecessary. Stage A (PR #1884) and Stage B (PR #1887) remain useful as general-purpose cuBLAS FP8 diagnostics and merge independently. See |
auto-merge was automatically disabled
May 22, 2026 14:24
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After a 1.5-hour bisect + layer-trace investigation on 2026-05-22, the cuBLAS FP8 7B Q4K
<|im_start|>gibberish is confirmed to be not a recent regression but a pre-existing fragile path broken across multiple releases (v0.31.2, v0.33.0, v0.34.0 all reproduce on the same RTX 4090).This PR adds
docs/specifications/SPEC-CUBLAS-FP8-7B-FIX-001.md— a 6-stage falsifier cascade epic to root-cause it, permemory/feedback_falsifier_cascade_decomposes_magnitude.md.Per-user decision (2026-05-22)
v0.35.0 release tag + crates.io publish cascade are held until this epic discharges. The 8 individual fix PRs already in flight (#1867 #1868 #1870 #1872 #1873 #1875 #1876 #1878) continue to land on main as net-positive improvements.
Why bisection was invalid
git bisect runnamed8bd4ce5a(the monorepo consolidation commit). Re-running the oracle at v0.31.2 — the "good" baseline — showed the same<|im_start|>gibberish. The Golden Output gate signal is non-deterministic; CUDA context poisoning recovery masks the symptom across some runs. Permemory/feedback_test_methodology_can_fake_bugs.md.What we know
apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf✗ FAIL Golden Output GPU output failed (CPU passed): gibberishGPU ≈ 0.96 × CPU + 0.12trueno-gpu v0.4.36The 6 stages
cublas_fp8_7b_reproducer.rs)APR_PER_LAYER_PARITY_DUMP=1)(3584, batch, 3584)Each PR ~50-200 LOC. Each falsifier LIVE-DISCHARGED with empirical evidence on the 7B Q4K teacher.
Test plan
🤖 Generated with Claude Code