fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu)#1876
Merged
Merged
Conversation
…#1864 wgpu) `apr run /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf` produced "ampiezza" gibberish via the wgpu (Vulkan) backend with exit 0. Two distinct loopholes: 1. **GGUF wgpu path had NO parity gate at all.** `try_wgpu_generate` (gguf_gpu_generate.rs:61) loaded weights and entered the autoregressive loop without any CPU-vs-wgpu correctness check. The .apr wgpu path had the v1.3.0 single-step gate (FALSIFY-CPU-GPU-005); the .gguf path had nothing. 2. **Single-step gate was insufficient for autoregressive drift.** Even on the .apr path, the gate only compared step 0 logits. Qwen2.5-7B Q4K passes step 0 (cosine ≥ 0.99) but the wgpu KV cache accumulates error each step. By step 2-3 the output is "ampiezza"-class gibberish, but the init-time probe never advances past step 0 so it can't see the drift. ## Fix (FALSIFY-CPU-GPU-006 / multi_step_parity_gate) Multi-step gate added to BOTH wgpu paths (try_wgpu_generate and try_apr_wgpu_inference). For N=3 steps (default; override via APR_WGPU_PARITY_STEPS in [1, 16]): - Run CPU forward via `forward_single_with_cache` AND wgpu forward via `fwd.forward_layer`, advancing both through the SAME deterministic token sequence (CPU argmax on cpu_logits feeds both). - Cosine-compare full vocab logits at every step. - ANY cosine < 0.99 emits `WGPU_FALLBACK_LOG_PREFIX` tagged with `at step <k>/<N>` + returns Err → caller falls back to CPU. Single-step (N=1) is backward-compatible by construction. ## Five-whys 1. `apr run` on 7B Q4K wgpu produces 'ampiezza' with exit 0. 2. Init-time gate compares only step 0; cos ≥ 0.99 there. 3. Subsequent autoregressive steps drift as the wgpu KV cache accumulates error. 4. The single-step probe couldn't observe drift because it never advanced past step 0. 5. Root cause: gate's domain was too narrow (one step instead of N). ## Live discharge ``` $ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \ "What is 2+2?" --max-tokens 16 Backend: wgpu (Vulkan) [PMAT-333] Dequantized 337 weights, 28282.5 MB F32 [wgpu] Skipping weight 'lm_head' (...) — CPU fallback ← known, benign [apr-cpu-vs-gpu-output-parity-v1] wgpu path rejected, attempting fallback: cosine vs CPU = 0.722249 (< 0.99) at step 1/3 ↑ NEW — step 1 cosine drift now caught Output: 2 + 2 equals 4. ← correct CPU output (was: "ampiezza") Completed in 32.12s (cached) ``` Pre-fix output on the same host/model: "ampiezza\nampiezza" (per #1864 issue body). Post-fix: correct CPU answer + clear tagged rejection log. ## Contract `apr-cpu-vs-gpu-output-parity-v1.yaml` bumped to v1.6.0: - new equation `multi_step_parity_gate` - new falsifier `FALSIFY-CPU-GPU-006` (LIVE_DISCHARGED 2026-05-22) - algorithm_evidence includes the full smoke trace above ## Scope This closes the wgpu **visibility + autoregressive-drift** loophole — no more silent gibberish reaches users. The underlying wgpu kernel bug that produces drift on hidden_dim=3584 (the actual numerical correctness issue in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub- issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to CPU at init. The cuBLAS FP8 path (`<|im_start|>` gibberish on `apr qa` Golden Output) is a SEPARATE regression and stays open in #1864 pending bisection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…itstream-io) (#1878) `cargo deny check advisories` started failing on every PR (and on main) 2026-05-22 with: error[unmaintained]: core2 is unmaintained, all versions yanked ├ ID: RUSTSEC-2026-0105 ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0105 The dep is pulled in transitively via `bitstream-io` (image/media decoding stack — `cargo tree` shows `bitstream-io v4.9.0 → core2 v0.4.0`). No first-party use; no drop-in replacement until upstream `bitstream-io` migrates off core2. This commit unblocks the in-flight PR cascade (#1867 #1868 #1870 #1873 #1875 #1876) which all failed CI's `ci / lint` step on this advisory. The deny entry is structured per the existing pattern in this file (id + human reason mentioning the transitive path) so revisiting the ignore in 6-12 months is straightforward. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
The first revision of this PR placed the new FALSIFY-CPU-GPU-006 block
AFTER `proof_obligations:` instead of inside the `falsification_tests:`
list. YAML parsed it as a 5th proof_obligation with no `property` field,
which `aprender-contracts` lint correctly flagged:
[ERROR] SCHEMA-005: proof_obligations[4].property must not be empty
(apr-cpu-vs-gpu-output-parity-v1)
This commit moves the block back into the falsification_tests list (after
F-CPU-GPU-005) and adds a 5th invariant proof_obligation that names the
multi-step gate explicitly, so `total_obligations: 6` in verification_summary
reconciles.
Also changes the F-006 status from `LIVE_DISCHARGED` to the existing enum
value `DISCHARGED` (the live-discharge evidence remains in the
algorithm_evidence block).
Runs `cargo test -p aprender-contracts --lib lint_passes_on_real_contracts`
locally — PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the wgpu side of #1864 (the cuBLAS FP8 side stays open pending bisect).
apr runon Qwen2.5-7B Q4K via wgpu producedampiezzagibberish with exit 0. Two loopholes — both now closed:try_wgpu_generate:61try_apr_wgpu_inference:455Five-whys
apr runon 7B Q4K wgpu producesampiezzawith exit 0.Fix (FALSIFY-CPU-GPU-006 /
multi_step_parity_gate)For N=3 steps (default; override via
APR_WGPU_PARITY_STEPSin [1, 16]):forward_single_with_cache+ wgpufwd.forward_layer, advancing both through the same deterministic token sequence (CPU argmax on cpu_logits steers both).WGPU_FALLBACK_LOG_PREFIXtaggedat step <k>/<N>+ returns Err → caller falls back to CPU.Single-step (N=1) is backward-compatible by construction (the v1.3.0 gate).
Live discharge (RTX 4090, this host)
Pre-fix on the same host/model:
ampiezza\nampiezza(per #1864).Post-fix: correct CPU answer + clear tagged rejection log.
Contract
apr-cpu-vs-gpu-output-parity-v1.yaml→ v1.6.0:multi_step_parity_gatealgorithm_evidenceincludes the full smoke trace aboveScope
This closes the wgpu visibility + autoregressive-drift loophole — no more silent gibberish reaches users. The underlying wgpu kernel bug that produces drift on
hidden_dim=3584(the actual numerical correctness issue in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub-issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to CPU at init.The cuBLAS FP8 path (
<|im_start|>gibberish onapr qaGolden Output) is a SEPARATE regression and stays open in #1864 pending bisection of v0.34.0..HEAD oncuda_chat_backend.rs.Test plan
🤖 Generated with Claude Code