feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A)#1884
Closed
noahgift wants to merge 2 commits into
Closed
feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A)#1884noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…S gibberish
SPEC-CUBLAS-FP8-7B-FIX-001 Stage A: a minimal standalone reproducer that
isolates the cuBLAS FP8 7B Q4K forward step and produces bit-identical
JSON output across consecutive runs. Unblocks Stages B-F which need a
deterministic oracle.
## Live discharge (2026-05-22, noah-Lambda-Vector RTX 4090)
5 consecutive invocations produced byte-identical JSON:
```
$ for i in 1 2 3 4 5; do
MODEL_PATH=/home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
target/release/examples/cublas_fp8_7b_reproducer
done
{"cpu_argmax_idx":75311,"cpu_argmax_val":11.554419,
"gpu_argmax_idx":1057,"gpu_argmax_val":11.132793,
"correlation":0.986986,
"cpu_logits_fnv1a":"dd5d13626cb48c40",
"gpu_logits_fnv1a":"6748eb76f78f8683",
"agrees_with_cpu":false}
(× 5)
```
FNV-1a fingerprints identical across runs → logit bytes are bit-identical.
The 2026-05-22 bisect failed because `apr qa`'s multi-gate sequence
shared a CUDA context that intermittently poisoned between gates. This
reproducer exercises ONLY the cuBLAS FP8 forward path with controlled
state — eliminating the cross-gate non-determinism.
## What this unlocks
- **Stage B** (per-layer parity) can trust its own outputs because the
full-forward signature is locked.
- **git bisect** can resume with a deterministic oracle (the reproducer's
exit code: 0 = agrees with CPU = bug fixed; 1 = disagrees = bug present).
- **Stage F** acceptance is now objective: gpu_argmax_idx changes from
1057 → 75311 AND agrees_with_cpu flips from false → true, on 5
consecutive runs.
## Files
- `crates/aprender-serve/examples/cublas_fp8_7b_reproducer.rs` (147 LOC)
— minimal main(): load 7B Q4K GGUF, CPU forward, cuBLAS forward,
cosine + linear-fit, FNV-1a fingerprint of logit bytes, single JSON
line on stdout, exit code reflects argmax agreement
- `contracts/cublas-fp8-7b-determinism-v1.yaml` (v1.0.0)
— equation `reproducer_bit_identity` + `signature_locks_the_bug`
— falsifiers FALSIFY-CUBLAS-FP8-DET-{001,002} LIVE-DISCHARGED
— 3 proof obligations
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
5 tasks
Contributor
Author
auto-merge was automatically disabled
May 22, 2026 14:59
Pull request was closed
noahgift
added a commit
that referenced
this pull request
May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline) * chore: README drift fix + apr serve syntax (#1873) * feat(cublas-fp8): Stage A deterministic reproducer (#1884) * feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887) * fix(1864): Golden Output gate must set stop_tokens (#1890) * chore(release): bump to v0.35.0 + CHANGELOG + README contract count Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites + regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit release scope: 1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121 2. MoE (Qwen3) KV cache + streaming SSE + sampling 3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see feedback_falsify_simple_before_deep.md README contract count 1151 → 1153 (post Stage A + Stage B contracts).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SPEC-CUBLAS-FP8-7B-FIX-001 Stage A
First of 6 stages toward closing #1864.
The 2026-05-22 bisect attempt was invalid because
apr qaGolden Output is non-deterministic (CUDA context poisoning between gates). This Stage A delivers a minimal standalone reproducer that produces bit-identical output across consecutive runs — so Stages B-F have a deterministic oracle.Live discharge (RTX 4090, this host)
5 consecutive invocations of the new
cublas_fp8_7b_reproducerproduced byte-identical JSON:FNV-1a logit fingerprint is identical across all 5 runs → bytes are bit-identical.
What this delivers
crates/aprender-serve/examples/cublas_fp8_7b_reproducer.rs(147 LOC)contracts/cublas-fp8-7b-determinism-v1.yaml(v1.0.0)Bug signature (locked at v1.0.0)
cpu_argmax_idxcpu_argmax_valgpu_argmax_idxgpu_argmax_valcorrelationcpu_logits_fnv1agpu_logits_fnv1aagrees_with_cpuStage F success criterion:
gpu_argmax_idxflips to 75311,agrees_with_cpuflips to true, on 5 consecutive runs.Test plan
cargo test -p aprender-contracts --lib lint_passes_on_real_contracts)cargo build --example cublas_fp8_7b_reproducer --release -p aprender-serve --features cuda)Next
After this lands → Stage B (per-layer parity instrumentation,
APR_PER_LAYER_PARITY_DUMP=1) to find the first divergent layer.🤖 Generated with Claude Code