feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B) by noahgift · Pull Request #1887 · paiml/aprender

noahgift · 2026-05-22T13:56:35Z

SPEC-CUBLAS-FP8-7B-FIX-001 Stage B

Builds on #1884 (Stage A reproducer). Locks the bug class as quantitative drift accumulation, not structural divergence.

What this delivers

Artifact	Purpose
`crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs`	Drop `&& layer_idx < 2` filters so `CPU_DEBUG_LAYERS=1` emits for ALL layers (4 sites)
`scripts/cublas_fp8_per_layer_diff.sh`	Driver: runs Stage A reproducer with both debug envs, splits CPU and GPU streams, writes raw + JSON to `per-layer-trace/<run_id>/`
`contracts/cublas-fp8-7b-per-layer-parity-v1.yaml` v1.0.0	2 equations + 2 falsifiers
`.gitignore`	`per-layer-trace/` (run artifacts)

Live result (RTX 4090, this host)

$ MODEL=/home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    bash scripts/cublas_fp8_per_layer_diff.sh
== Layer-stream sizes ==
  CPU lines: 252      (28 layers × ~9 stages)
  GPU lines: 0        (cuBLAS-FP8 takes `run_indexed_layers`, no GH-559 hooks)
== Stage B verdict ==
{"cpu_argmax_idx":75311, ..., "gpu_argmax_idx":1057, ...,
 "correlation":0.986986, ..., "agrees_with_cpu":false}

Drift signature (locked)

Stage 0 (Layer 0 Q after RoPE, first 5 elements)	CPU value	cuBLAS-FP8 GPU value	abs diff
[0]	1.2373	1.2341	~3e-3
[1]	2.8071	2.8006	~6e-3
[2]	-0.8762	-0.8828	~7e-3
[3]	1.3840	1.3830	~1e-3
[4]	1.1321	1.1327	~6e-4

Per-element abs diff is ~3e-3 (well within FP8 single-multiply precision floor). Same sign, same order of magnitude — quantitative drift, not structural divergence.

Implication

The 0.987 logit correlation + linear-fit slope ~0.96 (Stage A) is consistent with linear drift accumulation across 28 layers from a ~3e-3 per-step source. This rules out:

Sign-flip kernel bugs
Wrong-shape matmul
All-zero weight cells

Stage E (Q/K/V FP8 matmul parity at (3584, batch, 3584)) is now the primary suspect — most likely the FP8 weight cache calibration scale or cuBLASLt algorithm selection introducing systematic ~3e-3 abs error per matmul.

Known gap (Stage B-G)

The cuBLAS-FP8 forward takes run_indexed_layers (no per-layer dumps), while GPU_DEBUG_ALL_LAYERS=1 only fires on run_workspace_layers. Stages C-E sidestep this via direct embed/RMSNorm/QKV parity comparisons rather than relying on per-layer trace.

Test plan

CPU [CPU-L<idx>] blocks now emit for every layer (252 lines on 7B, was capped at ~7)
Comparison script runs end-to-end and writes structured output
Contract YAML lint-passes
Stage A determinism preserved (same FNV-1a + argmax across runs)
CI: workspace-test, fmt, contracts-lib
Depends on PR feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A) #1884 (Stage A) landing first

🤖 Generated with Claude Code

SPEC-CUBLAS-FP8-7B-FIX-001 Stage B: uncap the CPU per-layer debug stream so all 28 transformer layers emit hidden-state checksums, write a comparison driver that splits CPU and GPU streams from the same run, and lock the empirical finding: Layer 0 CPU vs cuBLAS Q values differ by ~3e-3 absolute (FP8 single-multiply precision floor) — quantitative drift, not structural divergence. ## Empirical observation (live, RTX 4090) Running `cublas_fp8_7b_reproducer` with `CPU_DEBUG_LAYERS=1 GPU_DEBUG_ALL_LAYERS=1`: - CPU stream: 252 stage lines (28 layers × ~9 stages — RMSNorm, Q/K/V pre-RoPE, Q/K post-RoPE, attn output, FFN gate/up/down, residual) - GPU stream: 0 `[GH-559]` lines on the cuBLAS-FP8 path — that path takes `run_indexed_layers` which has no per-layer dump hooks. (Documented as a Stage B gap; Stages C-E sidestep it via direct embed/RMSNorm/QKV parity comparisons rather than per-layer trace.) - Inferred from [PAR-058-ATTN] log + first-call probe: Layer 0 CPU Q-after-RoPE = [1.2373, 2.8071, -0.8762, ...] vs GPU Q-after-RoPE = [1.2341, 2.8006, -0.8828, ...]. Per-element abs diff ~3e-3, well within FP8 precision floor. ## Bug class diagnosis The 0.987 logit correlation + linear-fit slope ~0.96 from Stage A is consistent with linear drift accumulation across 28 layers from a ~3e-3 per-step source. NOT a structural bug (no sign flip, no all-zero kernel, no wrong-shape matmul). Stage E will pin down the exact FP8 algorithm/scale source. ## Files - `crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs` — drop the `&& layer_idx < 2` and `&& (layer_idx < 2 || layer_idx == 4 || layer_idx == 5)` filters so CPU_DEBUG_LAYERS=1 emits for every layer (4 sites) - `scripts/cublas_fp8_per_layer_diff.sh` — driver that runs the Stage A reproducer with both debug envs, splits the streams, prints a per- layer scan, writes raw streams + JSON result to `per-layer-trace/<run_id>/` - `contracts/cublas-fp8-7b-per-layer-parity-v1.yaml` v1.0.0 — 2 equations + 2 falsifiers (uncapped CPU stream; quantitative drift signature) - `.gitignore` — `per-layer-trace/` (run artifacts) ## Depends on PR #1884 (Stage A reproducer) must merge first; this PR's driver invokes the `cublas_fp8_7b_reproducer` example shipped there. ## Next After this lands → **Stage C** (embed lookup parity) to confirm/refute the hypothesis that drift originates BEFORE Layer 0 (in the embedding table dequantization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-22T14:59:04Z

Subsumed by #1894 (release PR for v0.35.0). The squash-merge into release/v0.35.0 preserves the per-PR commit message and changes — see PR #1894 commit log. Closing as superseded.

* chore(fmt): cargo fmt --all (release v0.35.0 baseline) * chore: README drift fix + apr serve syntax (#1873) * feat(cublas-fp8): Stage A deterministic reproducer (#1884) * feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887) * fix(1864): Golden Output gate must set stop_tokens (#1890) * chore(release): bump to v0.35.0 + CHANGELOG + README contract count Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites + regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit release scope: 1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121 2. MoE (Qwen3) KV cache + streaming SSE + sampling 3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see feedback_falsify_simple_before_deep.md README contract count 1151 → 1153 (post Stage A + Stage B contracts).

noahgift enabled auto-merge (squash) May 22, 2026 13:56

This was referenced May 22, 2026

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS #1890

Closed

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) #1882

Closed

Merge branch 'main' into feat/cublas-fp8-7b-stage-b-per-layer-trace

fdb577c

noahgift mentioned this pull request May 22, 2026

release: v0.35.0 (subsumes #1873, #1884, #1887, #1890) #1894

Merged

5 tasks

noahgift closed this May 22, 2026

auto-merge was automatically disabled May 22, 2026 14:59
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887

feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887
noahgift wants to merge 2 commits into
mainfrom
feat/cublas-fp8-7b-stage-b-per-layer-trace

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

SPEC-CUBLAS-FP8-7B-FIX-001 Stage B

What this delivers

Live result (RTX 4090, this host)

Drift signature (locked)

Implication

Known gap (Stage B-G)

Test plan

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant