Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions docs/specifications/SPEC-CUBLAS-FP8-7B-FIX-001.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# SPEC-CUBLAS-FP8-7B-FIX-001 — Root-cause + fix the cuBLAS FP8 7B Q4K gibberish

| Field | Value |
|---|---|
| Status | **PROPOSED** |
| Owner | (assigned at kickoff) |
| Created | 2026-05-22 |
| Last updated | 2026-05-22 |
| Tracks | [#1864](https://github.com/paiml/aprender/issues/1864) |
| Blocks | v0.35.0 release tag + crates.io publish cascade |
| Estimate | 2-5 days of focused work (6-stage falsifier cascade) |

## Why this is an epic, not a one-PR fix

The 2026-05-22 dogfood session surfaced `<|im_start|>` gibberish from `apr qa` Golden Output on `qwen2.5-coder-7b-instruct-q4_k_m.gguf` via the cuBLAS FP8 path. Initial investigation (1.5 hours, this date) attempted:

1. **`git bisect run`** with `apr qa` Golden Output as the oracle — identified commit `8bd4ce5a` (monorepo consolidation, 17,830 files / 13.7M insertions). Re-running the oracle on v0.31.2 (the "good" baseline from the bisect) **also produced the same gibberish** — the bisect was invalid because the test signal was non-deterministic.
2. **Layer-by-layer trace** via `cargo run --example layer_by_layer_trace --release -p aprender-serve --features cuda` — showed Layer 0 Q/K inputs already differ between CPU and cuBLAS GPU. Logit correlation 0.987 (high); linear fit `GPU ≈ 0.96 × CPU + 0.12`. Top residuals concentrated at vocab positions 15-22.
3. **Targeted-file reverts** — `cublas.rs` math mode (PEDANTIC ↔ DEFAULT ↔ TF32), `cublas_prefill/attention.rs`, `weights.rs`, `flash_decoding_graphed.rs`, `rms_norm.rs` (backward, training-only). None of these alone restored Golden Output.
4. **Dependency archaeology** — `trueno-gpu v0.4.36` was bit-identical at v0.31.2 and 8bd4ce5a (same crates.io checksum). `aprender-compute` re-export shim points are also consistent. The cuBLAS code itself didn't change.

**Conclusion**: this is not a simple regression. The cuBLAS FP8 path for `hidden_dim=3584` (Qwen2.5-Coder-7B) has likely been broken across multiple releases — including v0.34.0 currently on crates.io — and only surfaced now because:

- The wgpu path silently fell back to CPU (without #1876's multi-step gate, it shipped its own different gibberish)
- The cuBLAS path's CUDA context poisoning intermittently masks the symptom across runs
- `apr qa` Golden Output gate isn't wired into CI for the 7B teacher (per the 5-whys for #1864 — "provable contracts only verify what's *executed*")

## Falsifier cascade (per `feedback_falsifier_cascade_decomposes_magnitude`)

Six PRs, ~50-200 LOC each, each one a falsifiable hypothesis test. Each landing is a §N spec amendment with empirical evidence.

### Stage A — Deterministic reproducer

**Hypothesis**: The gibberish is intermittent; bisection requires it to be deterministic first.

**Falsifier**: `cargo run --example cublas_fp8_7b_reproducer --release -p aprender-serve --features cuda` produces bit-identical output on 5 consecutive runs (either all-gibberish or all-correct).

**Deliverable**:
- `crates/aprender-serve/examples/cublas_fp8_7b_reproducer.rs` — minimal harness: load 7B Q4K, run single forward step on cuBLAS, dump logits to JSON, compare against a golden reference.
- Environment audit: identify all sources of non-determinism (CUDA stream order, JIT cache warmup, FP8 weight cache LRU). Document and pin them.
- Contract: `contracts/cublas-fp8-7b-determinism-v1.yaml` with `FALSIFY-CUBLAS-FP8-DET-001` (5 consecutive runs bit-identical).

**Out-of-scope**: actually fixing the gibberish. This stage only makes the bug visible.

### Stage B — Per-layer parity instrumentation

**Hypothesis**: The existing `GPU_DEBUG_ALL_LAYERS=1` only dumps Layer 0 input/output. To find the first divergent layer we need every layer's `(rms_norm_out, q, k, v, attn_out, ffn_out, residual)` checksummed.

**Falsifier**: `APR_PER_LAYER_PARITY_DUMP=1 cublas_fp8_7b_reproducer` writes one JSON per layer with `(layer_idx, stage, cpu_checksum, gpu_checksum, cosine, max_abs_diff)`. The 28 outputs make divergence point trivially greppable.

**Deliverable**:
- New env-var-gated instrumentation in `crates/aprender-serve/src/cuda/executor/layers/cublas_prefill/` and the matching CPU forward in `crates/aprender-serve/src/gguf/runtime.rs`.
- Per-layer dump emits to `$APR_PER_LAYER_PARITY_DIR` (default `./trace-tensors/<run_id>/`).
- Contract: extend `apr-cpu-vs-gpu-output-parity-v1` with `per_layer_parity_dump` equation.

**Output**: a single sentence — "First layer where cosine drops below 0.99: layer K, stage S".

### Stage C — Embed lookup parity

**Hypothesis**: The Layer 0 K/Q inputs already differ (per the 2026-05-22 trace), so the divergence may originate in the embed lookup itself (token_id → embedding vector). 7B has `embed_tokens.weight = [152064, 3584]` in Q4_K which dequantizes to F32 = 2180 MB — at the wgpu max-binding limit but well under cuBLAS limits.

**Falsifier**: For `token_id = 791` (the example's probe), `cpu_embed[..16] == gpu_embed[..16]` bit-exactly (or within FP8 quantization error if the path uses FP8 embed).

**Deliverable**:
- Add a sub-target to the Stage A reproducer: `--check-embed` mode that compares the first 16 elements of the embedding lookup on both backends.
- If embed differs → fix here (likely a Q4_K dequant offset or layout issue).
- If embed matches → cascade continues to Stage D.

### Stage D — Pre-attention RMSNorm parity

**Hypothesis**: After embed, the first operation is RMSNorm. With CPU `1e-6` and GPU `1e-5` (the PMAT-698n bug class from prior cascades), eps-mismatch could explain Q/K divergence.

**Falsifier**: `cpu_rms_norm(embed[token_id]) == gpu_rms_norm(embed[token_id])` within 1 ULP for all 3584 hidden dims.

**Deliverable**:
- Test in `crates/aprender-serve/tests/cublas_fp8_rms_norm_parity.rs`.
- Verify both backends use the same `rms_norm_eps` from the GGUF metadata.
- If RMSNorm differs → fix the eps/algorithm.
- If RMSNorm matches → cascade continues to Stage E.

### Stage E — Q/K/V FP8 matmul parity

**Hypothesis**: The cuBLASLt FP8 GEMM at shape `(3584, batch, 3584)` produces wrong results for Qwen2's specific weight layout. This is the most likely candidate (per the layer-0 trace showing Q/K divergence after RMSNorm + projection).

**Falsifier**: For a fixed `rms_normed_input`, `Q_cpu = X @ Wq` (F32 matmul) and `Q_cublaslt_fp8 = X @ Wq` (FP8 matmul) agree within 1% relative error across all 3584 output dims.

**Deliverable**:
- Test in `crates/aprender-serve/tests/cublas_fp8_qkv_parity.rs`.
- Compare against `cublasLtMatmul` reference invocation with known inputs.
- Likely root cause space: FP8 scale calibration, accumulator precision, algorithm selection, JIT-compile cache miss reusing wrong kernel.

### Stage F — Root-cause fix + contract amendment

**Hypothesis**: Whatever Stages C/D/E surface as the divergence source has a known fix in either:
- `aprender-gpu/src/kernels/quantize/fp8/` (FP8 quantization)
- `aprender-serve/src/cuda/executor/layers/cublas_prefill/` (call-site setup)
- `trueno-gpu` (upstream cuBLAS driver — needs vendor PR if so)

**Falsifier**: `apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf` reports `✓ PASS Golden Output` on **5 consecutive runs** post-fix (per Stage A's determinism contract).

**Deliverable**:
- The actual fix (commit + tests).
- Contract amendment to `apr-cpu-vs-gpu-output-parity-v1` (probably v1.7.0 → v1.8.0):
- new equation `cublas_fp8_7b_correctness`
- new falsifier `FALSIFY-CPU-GPU-007` LIVE-DISCHARGED with 5-run evidence
- Release v0.35.0 once F lands.

## Why v0.35.0 holds

Per `feedback_release_only_after_bug_hunt`: don't ship a release with a major-bug-just-found, even if the bug pre-existed. Users running 7B Q4K on cuBLAS today get gibberish from `apr serve` HTTP and `apr code` (which uses serve). That's load-bearing functionality that needs a real fix, not a known-issue note.

v0.34.0 already has this bug; v0.35.0 holding doesn't make things worse. The 7 PRs (#1867, #1868, #1870, #1872, #1873, #1875, #1876, #1878) continue to merge into main as individual fixes — they're net positive regardless of when v0.35.0 cuts.

## Open questions for the kickoff

1. Does the bug reproduce on Blackwell GB10 (sm_121) too, or is it sm_89-specific?
2. Does the bug reproduce on Qwen2.5-7B variants with a *different* hidden_dim (e.g., a fictional 3712-dim variant)? — would test whether `hidden=3584` is the trigger or whether it's Qwen2 7B specifically.
3. Does the bug reproduce on `apr finetune` (training) too? — would test whether cuBLAS FP8 training is also broken, not just inference.
4. What's the relationship to closed bugs #374 and #559 (both fixed for sm_121, this reproducer is sm_89)?

## Out of scope

- 30B-MoE inference (#1583, separate epic)
- wgpu kernel-level fix (covered by #1864 sub-issue once #1876 lands)
- Apollo/jetson cross-platform validation (separate hardware-availability work)

## References

- [#1864](https://github.com/paiml/aprender/issues/1864) — tracking issue
- PR [#1876](https://github.com/paiml/aprender/pull/1876) — wgpu side fix (multi-step parity gate)
- `contracts/apr-cpu-vs-gpu-output-parity-v1.yaml` — current parity contract
- `crates/aprender-serve/examples/layer_by_layer_trace.rs` — existing diagnostic
- `memory/feedback_falsifier_cascade_decomposes_magnitude.md` — 6-stage cascade pattern
- `memory/feedback_release_only_after_bug_hunt.md` — release-hold justification
- `memory/feedback_test_methodology_can_fake_bugs.md` — why the initial bisect was invalid
Loading