feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887
Closed
noahgift wants to merge 2 commits into
Closed
feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
SPEC-CUBLAS-FP8-7B-FIX-001 Stage B: uncap the CPU per-layer debug stream so all 28 transformer layers emit hidden-state checksums, write a comparison driver that splits CPU and GPU streams from the same run, and lock the empirical finding: Layer 0 CPU vs cuBLAS Q values differ by ~3e-3 absolute (FP8 single-multiply precision floor) — quantitative drift, not structural divergence. ## Empirical observation (live, RTX 4090) Running `cublas_fp8_7b_reproducer` with `CPU_DEBUG_LAYERS=1 GPU_DEBUG_ALL_LAYERS=1`: - CPU stream: 252 stage lines (28 layers × ~9 stages — RMSNorm, Q/K/V pre-RoPE, Q/K post-RoPE, attn output, FFN gate/up/down, residual) - GPU stream: 0 `[GH-559]` lines on the cuBLAS-FP8 path — that path takes `run_indexed_layers` which has no per-layer dump hooks. (Documented as a Stage B gap; Stages C-E sidestep it via direct embed/RMSNorm/QKV parity comparisons rather than per-layer trace.) - Inferred from [PAR-058-ATTN] log + first-call probe: Layer 0 CPU Q-after-RoPE = [1.2373, 2.8071, -0.8762, ...] vs GPU Q-after-RoPE = [1.2341, 2.8006, -0.8828, ...]. Per-element abs diff ~3e-3, well within FP8 precision floor. ## Bug class diagnosis The 0.987 logit correlation + linear-fit slope ~0.96 from Stage A is consistent with linear drift accumulation across 28 layers from a ~3e-3 per-step source. NOT a structural bug (no sign flip, no all-zero kernel, no wrong-shape matmul). Stage E will pin down the exact FP8 algorithm/scale source. ## Files - `crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs` — drop the `&& layer_idx < 2` and `&& (layer_idx < 2 || layer_idx == 4 || layer_idx == 5)` filters so CPU_DEBUG_LAYERS=1 emits for every layer (4 sites) - `scripts/cublas_fp8_per_layer_diff.sh` — driver that runs the Stage A reproducer with both debug envs, splits the streams, prints a per- layer scan, writes raw streams + JSON result to `per-layer-trace/<run_id>/` - `contracts/cublas-fp8-7b-per-layer-parity-v1.yaml` v1.0.0 — 2 equations + 2 falsifiers (uncapped CPU stream; quantitative drift signature) - `.gitignore` — `per-layer-trace/` (run artifacts) ## Depends on PR #1884 (Stage A reproducer) must merge first; this PR's driver invokes the `cublas_fp8_7b_reproducer` example shipped there. ## Next After this lands → **Stage C** (embed lookup parity) to confirm/refute the hypothesis that drift originates BEFORE Layer 0 (in the embedding table dequantization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
5 tasks
Contributor
Author
auto-merge was automatically disabled
May 22, 2026 14:59
Pull request was closed
noahgift
added a commit
that referenced
this pull request
May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline) * chore: README drift fix + apr serve syntax (#1873) * feat(cublas-fp8): Stage A deterministic reproducer (#1884) * feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887) * fix(1864): Golden Output gate must set stop_tokens (#1890) * chore(release): bump to v0.35.0 + CHANGELOG + README contract count Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites + regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit release scope: 1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121 2. MoE (Qwen3) KV cache + streaming SSE + sampling 3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see feedback_falsify_simple_before_deep.md README contract count 1151 → 1153 (post Stage A + Stage B contracts).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SPEC-CUBLAS-FP8-7B-FIX-001 Stage B
Builds on #1884 (Stage A reproducer). Locks the bug class as quantitative drift accumulation, not structural divergence.
What this delivers
crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs&& layer_idx < 2filters soCPU_DEBUG_LAYERS=1emits for ALL layers (4 sites)scripts/cublas_fp8_per_layer_diff.shper-layer-trace/<run_id>/contracts/cublas-fp8-7b-per-layer-parity-v1.yamlv1.0.0.gitignoreper-layer-trace/(run artifacts)Live result (RTX 4090, this host)
Drift signature (locked)
Per-element abs diff is ~3e-3 (well within FP8 single-multiply precision floor). Same sign, same order of magnitude — quantitative drift, not structural divergence.
Implication
The 0.987 logit correlation + linear-fit slope ~0.96 (Stage A) is consistent with linear drift accumulation across 28 layers from a ~3e-3 per-step source. This rules out:
Stage E (Q/K/V FP8 matmul parity at
(3584, batch, 3584)) is now the primary suspect — most likely the FP8 weight cache calibration scale or cuBLASLt algorithm selection introducing systematic ~3e-3 abs error per matmul.Known gap (Stage B-G)
The cuBLAS-FP8 forward takes
run_indexed_layers(no per-layer dumps), whileGPU_DEBUG_ALL_LAYERS=1only fires onrun_workspace_layers. Stages C-E sidestep this via direct embed/RMSNorm/QKV parity comparisons rather than relying on per-layer trace.Test plan
[CPU-L<idx>]blocks now emit for every layer (252 lines on 7B, was capped at ~7)🤖 Generated with Claude Code