Skip to content

feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887

Closed
noahgift wants to merge 2 commits into
mainfrom
feat/cublas-fp8-7b-stage-b-per-layer-trace
Closed

feat(cublas-fp8-7b-stage-b): per-layer parity dumps + drift signature (SPEC Stage B)#1887
noahgift wants to merge 2 commits into
mainfrom
feat/cublas-fp8-7b-stage-b-per-layer-trace

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

SPEC-CUBLAS-FP8-7B-FIX-001 Stage B

Builds on #1884 (Stage A reproducer). Locks the bug class as quantitative drift accumulation, not structural divergence.

What this delivers

Artifact Purpose
crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs Drop && layer_idx < 2 filters so CPU_DEBUG_LAYERS=1 emits for ALL layers (4 sites)
scripts/cublas_fp8_per_layer_diff.sh Driver: runs Stage A reproducer with both debug envs, splits CPU and GPU streams, writes raw + JSON to per-layer-trace/<run_id>/
contracts/cublas-fp8-7b-per-layer-parity-v1.yaml v1.0.0 2 equations + 2 falsifiers
.gitignore per-layer-trace/ (run artifacts)

Live result (RTX 4090, this host)

$ MODEL=/home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    bash scripts/cublas_fp8_per_layer_diff.sh
== Layer-stream sizes ==
  CPU lines: 252      (28 layers × ~9 stages)
  GPU lines: 0        (cuBLAS-FP8 takes `run_indexed_layers`, no GH-559 hooks)
== Stage B verdict ==
{"cpu_argmax_idx":75311, ..., "gpu_argmax_idx":1057, ...,
 "correlation":0.986986, ..., "agrees_with_cpu":false}

Drift signature (locked)

Stage 0 (Layer 0 Q after RoPE, first 5 elements) CPU value cuBLAS-FP8 GPU value abs diff
[0] 1.2373 1.2341 ~3e-3
[1] 2.8071 2.8006 ~6e-3
[2] -0.8762 -0.8828 ~7e-3
[3] 1.3840 1.3830 ~1e-3
[4] 1.1321 1.1327 ~6e-4

Per-element abs diff is ~3e-3 (well within FP8 single-multiply precision floor). Same sign, same order of magnitude — quantitative drift, not structural divergence.

Implication

The 0.987 logit correlation + linear-fit slope ~0.96 (Stage A) is consistent with linear drift accumulation across 28 layers from a ~3e-3 per-step source. This rules out:

  • Sign-flip kernel bugs
  • Wrong-shape matmul
  • All-zero weight cells

Stage E (Q/K/V FP8 matmul parity at (3584, batch, 3584)) is now the primary suspect — most likely the FP8 weight cache calibration scale or cuBLASLt algorithm selection introducing systematic ~3e-3 abs error per matmul.

Known gap (Stage B-G)

The cuBLAS-FP8 forward takes run_indexed_layers (no per-layer dumps), while GPU_DEBUG_ALL_LAYERS=1 only fires on run_workspace_layers. Stages C-E sidestep this via direct embed/RMSNorm/QKV parity comparisons rather than relying on per-layer trace.

Test plan

🤖 Generated with Claude Code

SPEC-CUBLAS-FP8-7B-FIX-001 Stage B: uncap the CPU per-layer debug stream
so all 28 transformer layers emit hidden-state checksums, write a
comparison driver that splits CPU and GPU streams from the same run,
and lock the empirical finding: Layer 0 CPU vs cuBLAS Q values differ
by ~3e-3 absolute (FP8 single-multiply precision floor) — quantitative
drift, not structural divergence.

## Empirical observation (live, RTX 4090)

Running `cublas_fp8_7b_reproducer` with `CPU_DEBUG_LAYERS=1 GPU_DEBUG_ALL_LAYERS=1`:

- CPU stream: 252 stage lines (28 layers × ~9 stages — RMSNorm, Q/K/V pre-RoPE,
  Q/K post-RoPE, attn output, FFN gate/up/down, residual)
- GPU stream: 0 `[GH-559]` lines on the cuBLAS-FP8 path — that path takes
  `run_indexed_layers` which has no per-layer dump hooks. (Documented as
  a Stage B gap; Stages C-E sidestep it via direct embed/RMSNorm/QKV
  parity comparisons rather than per-layer trace.)
- Inferred from [PAR-058-ATTN] log + first-call probe: Layer 0
  CPU Q-after-RoPE = [1.2373, 2.8071, -0.8762, ...] vs
  GPU Q-after-RoPE = [1.2341, 2.8006, -0.8828, ...].
  Per-element abs diff ~3e-3, well within FP8 precision floor.

## Bug class diagnosis

The 0.987 logit correlation + linear-fit slope ~0.96 from Stage A is
consistent with linear drift accumulation across 28 layers from a
~3e-3 per-step source. NOT a structural bug (no sign flip, no all-zero
kernel, no wrong-shape matmul). Stage E will pin down the exact FP8
algorithm/scale source.

## Files

- `crates/aprender-serve/src/gguf/inference/forward/forward_fused_q4k.rs`
  — drop the `&& layer_idx < 2` and `&& (layer_idx < 2 || layer_idx == 4 || layer_idx == 5)`
    filters so CPU_DEBUG_LAYERS=1 emits for every layer (4 sites)
- `scripts/cublas_fp8_per_layer_diff.sh` — driver that runs the Stage A
  reproducer with both debug envs, splits the streams, prints a per-
  layer scan, writes raw streams + JSON result to `per-layer-trace/<run_id>/`
- `contracts/cublas-fp8-7b-per-layer-parity-v1.yaml` v1.0.0 — 2 equations
  + 2 falsifiers (uncapped CPU stream; quantitative drift signature)
- `.gitignore` — `per-layer-trace/` (run artifacts)

## Depends on

PR #1884 (Stage A reproducer) must merge first; this PR's driver invokes
the `cublas_fp8_7b_reproducer` example shipped there.

## Next

After this lands → **Stage C** (embed lookup parity) to confirm/refute
the hypothesis that drift originates BEFORE Layer 0 (in the embedding
table dequantization).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Subsumed by #1894 (release PR for v0.35.0). The squash-merge into release/v0.35.0 preserves the per-PR commit message and changes — see PR #1894 commit log. Closing as superseded.

@noahgift noahgift closed this May 22, 2026
auto-merge was automatically disabled May 22, 2026 14:59

Pull request was closed

noahgift added a commit that referenced this pull request May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline)

* chore: README drift fix + apr serve syntax (#1873)

* feat(cublas-fp8): Stage A deterministic reproducer (#1884)

* feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887)

* fix(1864): Golden Output gate must set stop_tokens (#1890)

* chore(release): bump to v0.35.0 + CHANGELOG + README contract count

Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites
+ regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit
release scope:

1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121
2. MoE (Qwen3) KV cache + streaming SSE + sampling
3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line
   stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see
   feedback_falsify_simple_before_deep.md

README contract count 1151 → 1153 (post Stage A + Stage B contracts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant