Skip to content

fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu)#1876

Merged
noahgift merged 7 commits into
mainfrom
fix/1864-wgpu-no-silent-partial-fallback
May 22, 2026
Merged

fix(wgpu-parity): multi-step gate catches 7B Q4K autoregressive drift (#1864 wgpu)#1876
noahgift merged 7 commits into
mainfrom
fix/1864-wgpu-no-silent-partial-fallback

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Closes the wgpu side of #1864 (the cuBLAS FP8 side stays open pending bisect).

apr run on Qwen2.5-7B Q4K via wgpu produced ampiezza gibberish with exit 0. Two loopholes — both now closed:

Loophole Where Fix
GGUF wgpu path had NO parity gate at all try_wgpu_generate:61 Add multi-step gate (matches APR path)
Single-step gate missed autoregressive drift try_apr_wgpu_inference:455 Extend to multi-step (3 by default)

Five-whys

  1. apr run on 7B Q4K wgpu produces ampiezza with exit 0.
  2. Init-time parity gate (FALSIFY-CPU-GPU-005) compares only step 0; cosine ≥ 0.99 there.
  3. Subsequent autoregressive steps drift because the wgpu KV cache accumulates error.
  4. The single-step probe couldn't observe drift because it never advanced past step 0.
  5. Root cause: the gate's domain was too narrow (one step instead of N).

Fix (FALSIFY-CPU-GPU-006 / multi_step_parity_gate)

For N=3 steps (default; override via APR_WGPU_PARITY_STEPS in [1, 16]):

  • Run CPU forward_single_with_cache + wgpu fwd.forward_layer, advancing both through the same deterministic token sequence (CPU argmax on cpu_logits steers both).
  • Cosine-compare full vocab logits at every step.
  • ANY cosine < 0.99 emits WGPU_FALLBACK_LOG_PREFIX tagged at step <k>/<N> + returns Err → caller falls back to CPU.

Single-step (N=1) is backward-compatible by construction (the v1.3.0 gate).

Live discharge (RTX 4090, this host)

$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    "What is 2+2?" --max-tokens 16

Backend: wgpu (Vulkan)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[apr-cpu-vs-gpu-output-parity-v1] wgpu path rejected,
  attempting fallback: cosine vs CPU = 0.722249 (< 0.99) at step 1/3
Output:
2 + 2 equals 4.    ← correct CPU output (was: "ampiezza")

Pre-fix on the same host/model: ampiezza\nampiezza (per #1864).
Post-fix: correct CPU answer + clear tagged rejection log.

Contract

apr-cpu-vs-gpu-output-parity-v1.yamlv1.6.0:

  • new equation multi_step_parity_gate
  • new falsifier FALSIFY-CPU-GPU-006 (LIVE_DISCHARGED 2026-05-22)
  • algorithm_evidence includes the full smoke trace above

Scope

This closes the wgpu visibility + autoregressive-drift loophole — no more silent gibberish reaches users. The underlying wgpu kernel bug that produces drift on hidden_dim=3584 (the actual numerical correctness issue in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub-issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to CPU at init.

The cuBLAS FP8 path (<|im_start|> gibberish on apr qa Golden Output) is a SEPARATE regression and stays open in #1864 pending bisection of v0.34.0..HEAD on cuda_chat_backend.rs.

Test plan

  • cargo check -p aprender-serve --features cuda
  • cargo test fallback_log_prefix — 3/3 PASS (backward compat)
  • Live: 7B Q4K wgpu → CPU fallback → "2 + 2 equals 4." (was gibberish)
  • FALSIFY-CPU-GPU-{001..005} backward compat (single-step reduces to v1.5.0 design)
  • CI: workspace-test, contracts-lib, fmt

🤖 Generated with Claude Code

…#1864 wgpu)

`apr run /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf` produced
"ampiezza" gibberish via the wgpu (Vulkan) backend with exit 0. Two
distinct loopholes:

1. **GGUF wgpu path had NO parity gate at all.** `try_wgpu_generate`
   (gguf_gpu_generate.rs:61) loaded weights and entered the autoregressive
   loop without any CPU-vs-wgpu correctness check. The .apr wgpu path had
   the v1.3.0 single-step gate (FALSIFY-CPU-GPU-005); the .gguf path had
   nothing.

2. **Single-step gate was insufficient for autoregressive drift.** Even on
   the .apr path, the gate only compared step 0 logits. Qwen2.5-7B Q4K
   passes step 0 (cosine ≥ 0.99) but the wgpu KV cache accumulates error
   each step. By step 2-3 the output is "ampiezza"-class gibberish, but
   the init-time probe never advances past step 0 so it can't see the drift.

## Fix (FALSIFY-CPU-GPU-006 / multi_step_parity_gate)

Multi-step gate added to BOTH wgpu paths (try_wgpu_generate and
try_apr_wgpu_inference). For N=3 steps (default; override via
APR_WGPU_PARITY_STEPS in [1, 16]):

- Run CPU forward via `forward_single_with_cache` AND wgpu forward via
  `fwd.forward_layer`, advancing both through the SAME deterministic
  token sequence (CPU argmax on cpu_logits feeds both).
- Cosine-compare full vocab logits at every step.
- ANY cosine < 0.99 emits `WGPU_FALLBACK_LOG_PREFIX` tagged with
  `at step <k>/<N>` + returns Err → caller falls back to CPU.

Single-step (N=1) is backward-compatible by construction.

## Five-whys

1. `apr run` on 7B Q4K wgpu produces 'ampiezza' with exit 0.
2. Init-time gate compares only step 0; cos ≥ 0.99 there.
3. Subsequent autoregressive steps drift as the wgpu KV cache
   accumulates error.
4. The single-step probe couldn't observe drift because it never
   advanced past step 0.
5. Root cause: gate's domain was too narrow (one step instead of N).

## Live discharge

```
$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    "What is 2+2?" --max-tokens 16

Backend: wgpu (Vulkan)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[wgpu] Skipping weight 'lm_head' (...) — CPU fallback   ← known, benign
[apr-cpu-vs-gpu-output-parity-v1] wgpu path rejected,
  attempting fallback: cosine vs CPU = 0.722249 (< 0.99) at step 1/3
                                                   ↑
                          NEW — step 1 cosine drift now caught
Output:
2 + 2 equals 4.    ← correct CPU output (was: "ampiezza")

Completed in 32.12s (cached)
```

Pre-fix output on the same host/model: "ampiezza\nampiezza" (per #1864
issue body). Post-fix: correct CPU answer + clear tagged rejection log.

## Contract

`apr-cpu-vs-gpu-output-parity-v1.yaml` bumped to v1.6.0:
- new equation `multi_step_parity_gate`
- new falsifier `FALSIFY-CPU-GPU-006` (LIVE_DISCHARGED 2026-05-22)
- algorithm_evidence includes the full smoke trace above

## Scope

This closes the wgpu **visibility + autoregressive-drift** loophole — no
more silent gibberish reaches users. The underlying wgpu kernel bug that
produces drift on hidden_dim=3584 (the actual numerical correctness issue
in attention or FFN dispatch) remains open as a #1864 wgpu-kernel sub-
issue. Until that root-cause lands, 7B Q4K wgpu correctly falls back to
CPU at init.

The cuBLAS FP8 path (`<|im_start|>` gibberish on `apr qa` Golden Output)
is a SEPARATE regression and stays open in #1864 pending bisection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 22, 2026
…itstream-io) (#1878)

`cargo deny check advisories` started failing on every PR (and on main)
2026-05-22 with:

    error[unmaintained]: core2 is unmaintained, all versions yanked
    ├ ID: RUSTSEC-2026-0105
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0105

The dep is pulled in transitively via `bitstream-io` (image/media decoding
stack — `cargo tree` shows `bitstream-io v4.9.0 → core2 v0.4.0`). No
first-party use; no drop-in replacement until upstream `bitstream-io`
migrates off core2.

This commit unblocks the in-flight PR cascade (#1867 #1868 #1870 #1873
#1875 #1876) which all failed CI's `ci / lint` step on this advisory.
The deny entry is structured per the existing pattern in this file (id +
human reason mentioning the transitive path) so revisiting the ignore in
6-12 months is straightforward.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift and others added 5 commits May 22, 2026 11:47
The first revision of this PR placed the new FALSIFY-CPU-GPU-006 block
AFTER `proof_obligations:` instead of inside the `falsification_tests:`
list. YAML parsed it as a 5th proof_obligation with no `property` field,
which `aprender-contracts` lint correctly flagged:

    [ERROR] SCHEMA-005: proof_obligations[4].property must not be empty
    (apr-cpu-vs-gpu-output-parity-v1)

This commit moves the block back into the falsification_tests list (after
F-CPU-GPU-005) and adds a 5th invariant proof_obligation that names the
multi-step gate explicitly, so `total_obligations: 6` in verification_summary
reconciles.

Also changes the F-006 status from `LIVE_DISCHARGED` to the existing enum
value `DISCHARGED` (the live-discharge evidence remains in the
algorithm_evidence block).

Runs `cargo test -p aprender-contracts --lib lint_passes_on_real_contracts`
locally — PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit bec4a2d into main May 22, 2026
10 checks passed
@noahgift noahgift deleted the fix/1864-wgpu-no-silent-partial-fallback branch May 22, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant