fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu) by noahgift · Pull Request #1876 · paiml/aprender

noahgift · 2026-05-22T08:41:06Z

Summary

Closes the wgpu side of #1864 (the cuBLAS FP8 side stays open pending bisect).

apr run on Qwen2.5-7B Q4K via wgpu produced ampiezza gibberish with exit 0. Two loopholes — both now closed:

Loophole	Where	Fix
GGUF wgpu path had NO parity gate at all	`try_wgpu_generate:61`	Add multi-step gate (matches APR path)
Single-step gate missed autoregressive drift	`try_apr_wgpu_inference:455`	Extend to multi-step (3 by default)

Five-whys

apr run on 7B Q4K wgpu produces ampiezza with exit 0.
Init-time parity gate (FALSIFY-CPU-GPU-005) compares only step 0; cosine ≥ 0.99 there.
Subsequent autoregressive steps drift because the wgpu KV cache accumulates error.
The single-step probe couldn't observe drift because it never advanced past step 0.
Root cause: the gate's domain was too narrow (one step instead of N).

Fix (FALSIFY-CPU-GPU-006 / `multi_step_parity_gate`)

For N=3 steps (default; override via APR_WGPU_PARITY_STEPS in [1, 16]):

Run CPU forward_single_with_cache + wgpu fwd.forward_layer, advancing both through the same deterministic token sequence (CPU argmax on cpu_logits steers both).
Cosine-compare full vocab logits at every step.
ANY cosine < 0.99 emits WGPU_FALLBACK_LOG_PREFIX tagged at step <k>/<N> + returns Err → caller falls back to CPU.

Single-step (N=1) is backward-compatible by construction (the v1.3.0 gate).

Live discharge (RTX 4090, this host)

$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    "What is 2+2?" --max-tokens 16

Backend: wgpu (Vulkan)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[apr-cpu-vs-gpu-output-parity-v1] wgpu path rejected,
  attempting fallback: cosine vs CPU = 0.722249 (< 0.99) at step 1/3
Output:
2 + 2 equals 4.    ← correct CPU output (was: "ampiezza")

Pre-fix on the same host/model: ampiezza\nampiezza (per #1864).
Post-fix: correct CPU answer + clear tagged rejection log.

Contract

apr-cpu-vs-gpu-output-parity-v1.yaml → v1.6.0:

new equation multi_step_parity_gate
new falsifier FALSIFY-CPU-GPU-006 (LIVE_DISCHARGED 2026-05-22)
algorithm_evidence includes the full smoke trace above

Scope

This closes the wgpu visibility + autoregressive-drift loophole — no more silent gibberish reaches users. The underlying wgpu kernel bug that produces drift on hidden_dim=3584 (the actual numerical correctness issue in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub-issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to CPU at init.

The cuBLAS FP8 path (<|im_start|> gibberish on apr qa Golden Output) is a SEPARATE regression and stays open in #1864 pending bisection of v0.34.0..HEAD on cuda_chat_backend.rs.

Test plan

cargo check -p aprender-serve --features cuda
cargo test fallback_log_prefix — 3/3 PASS (backward compat)
Live: 7B Q4K wgpu → CPU fallback → "2 + 2 equals 4." (was gibberish)
FALSIFY-CPU-GPU-{001..005} backward compat (single-step reduces to v1.5.0 design)
CI: workspace-test, contracts-lib, fmt

🤖 Generated with Claude Code

…#1864 wgpu) `apr run /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf` produced "ampiezza" gibberish via the wgpu (Vulkan) backend with exit 0. Two distinct loopholes: 1. **GGUF wgpu path had NO parity gate at all.** `try_wgpu_generate` (gguf_gpu_generate.rs:61) loaded weights and entered the autoregressive loop without any CPU-vs-wgpu correctness check. The .apr wgpu path had the v1.3.0 single-step gate (FALSIFY-CPU-GPU-005); the .gguf path had nothing. 2. **Single-step gate was insufficient for autoregressive drift.** Even on the .apr path, the gate only compared step 0 logits. Qwen2.5-7B Q4K passes step 0 (cosine ≥ 0.99) but the wgpu KV cache accumulates error each step. By step 2-3 the output is "ampiezza"-class gibberish, but the init-time probe never advances past step 0 so it can't see the drift. ## Fix (FALSIFY-CPU-GPU-006 / multi_step_parity_gate) Multi-step gate added to BOTH wgpu paths (try_wgpu_generate and try_apr_wgpu_inference). For N=3 steps (default; override via APR_WGPU_PARITY_STEPS in [1, 16]): - Run CPU forward via `forward_single_with_cache` AND wgpu forward via `fwd.forward_layer`, advancing both through the SAME deterministic token sequence (CPU argmax on cpu_logits feeds both). - Cosine-compare full vocab logits at every step. - ANY cosine < 0.99 emits `WGPU_FALLBACK_LOG_PREFIX` tagged with `at step <k>/<N>` + returns Err → caller falls back to CPU. Single-step (N=1) is backward-compatible by construction. ## Five-whys 1. `apr run` on 7B Q4K wgpu produces 'ampiezza' with exit 0. 2. Init-time gate compares only step 0; cos ≥ 0.99 there. 3. Subsequent autoregressive steps drift as the wgpu KV cache accumulates error. 4. The single-step probe couldn't observe drift because it never advanced past step 0. 5. Root cause: gate's domain was too narrow (one step instead of N). ## Live discharge ``` $ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \ "What is 2+2?" --max-tokens 16 Backend: wgpu (Vulkan) [PMAT-333] Dequantized 337 weights, 28282.5 MB F32 [wgpu] Skipping weight 'lm_head' (...) — CPU fallback ← known, benign [apr-cpu-vs-gpu-output-parity-v1] wgpu path rejected, attempting fallback: cosine vs CPU = 0.722249 (< 0.99) at step 1/3 ↑ NEW — step 1 cosine drift now caught Output: 2 + 2 equals 4. ← correct CPU output (was: "ampiezza") Completed in 32.12s (cached) ``` Pre-fix output on the same host/model: "ampiezza\nampiezza" (per #1864 issue body). Post-fix: correct CPU answer + clear tagged rejection log. ## Contract `apr-cpu-vs-gpu-output-parity-v1.yaml` bumped to v1.6.0: - new equation `multi_step_parity_gate` - new falsifier `FALSIFY-CPU-GPU-006` (LIVE_DISCHARGED 2026-05-22) - algorithm_evidence includes the full smoke trace above ## Scope This closes the wgpu **visibility + autoregressive-drift** loophole — no more silent gibberish reaches users. The underlying wgpu kernel bug that produces drift on hidden_dim=3584 (the actual numerical correctness issue in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub- issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to CPU at init. The cuBLAS FP8 path (`<|im_start|>` gibberish on `apr qa` Golden Output) is a SEPARATE regression and stays open in #1864 pending bisection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…itstream-io) (#1878) `cargo deny check advisories` started failing on every PR (and on main) 2026-05-22 with: error[unmaintained]: core2 is unmaintained, all versions yanked ├ ID: RUSTSEC-2026-0105 ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0105 The dep is pulled in transitively via `bitstream-io` (image/media decoding stack — `cargo tree` shows `bitstream-io v4.9.0 → core2 v0.4.0`). No first-party use; no drop-in replacement until upstream `bitstream-io` migrates off core2. This commit unblocks the in-flight PR cascade (#1867 #1868 #1870 #1873 #1875 #1876) which all failed CI's `ci / lint` step on this advisory. The deny entry is structured per the existing pattern in this file (id + human reason mentioning the transitive path) so revisiting the ignore in 6-12 months is straightforward. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

The first revision of this PR placed the new FALSIFY-CPU-GPU-006 block AFTER `proof_obligations:` instead of inside the `falsification_tests:` list. YAML parsed it as a 5th proof_obligation with no `property` field, which `aprender-contracts` lint correctly flagged: [ERROR] SCHEMA-005: proof_obligations[4].property must not be empty (apr-cpu-vs-gpu-output-parity-v1) This commit moves the block back into the falsification_tests list (after F-CPU-GPU-005) and adds a 5th invariant proof_obligation that names the multi-step gate explicitly, so `total_obligations: 6` in verification_summary reconciles. Also changes the F-006 status from `LIVE_DISCHARGED` to the existing enum value `DISCHARGED` (the live-discharge evidence remains in the algorithm_evidence block). Runs `cargo test -p aprender-contracts --lib lint_passes_on_real_contracts` locally — PASS. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 08:41

noahgift mentioned this pull request May 22, 2026

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Open

Merge branch 'main' into fix/1864-wgpu-no-silent-partial-fallback

0ef0895

noahgift mentioned this pull request May 22, 2026

chore(deny): ignore RUSTSEC-2026-0105 (core2 yanked, transitive via bitstream-io) #1878

Merged

3 tasks

noahgift and others added 5 commits May 22, 2026 11:47

Merge branch 'main' into fix/1864-wgpu-no-silent-partial-fallback

d29cfee

Merge branch 'main' into fix/1864-wgpu-no-silent-partial-fallback

e699b4f

Merge branch 'main' into fix/1864-wgpu-no-silent-partial-fallback

b7af79c

Merge branch 'main' into fix/1864-wgpu-no-silent-partial-fallback

5bf5c7e

noahgift merged commit bec4a2d into main May 22, 2026
10 checks passed

noahgift deleted the fix/1864-wgpu-no-silent-partial-fallback branch May 22, 2026 11:53

This was referenced May 22, 2026

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) #1882

Closed

fix(qa): add EOS stop_tokens to Golden Output gate — closes phantom #1864 cuBLAS #1890

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu)#1876

fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu)#1876
noahgift merged 7 commits into
mainfrom
fix/1864-wgpu-no-silent-partial-fallback

noahgift commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Five-whys

Fix (FALSIFY-CPU-GPU-006 / multi_step_parity_gate)

Live discharge (RTX 4090, this host)

Contract

Scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix (FALSIFY-CPU-GPU-006 / `multi_step_parity_gate`)