Skip to content

feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A)#1884

Closed
noahgift wants to merge 2 commits into
mainfrom
feat/cublas-fp8-7b-stage-a-reproducer
Closed

feat(cublas-fp8-7b-stage-a): deterministic reproducer for #1864 (SPEC Stage A)#1884
noahgift wants to merge 2 commits into
mainfrom
feat/cublas-fp8-7b-stage-a-reproducer

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

SPEC-CUBLAS-FP8-7B-FIX-001 Stage A

First of 6 stages toward closing #1864.

The 2026-05-22 bisect attempt was invalid because apr qa Golden Output is non-deterministic (CUDA context poisoning between gates). This Stage A delivers a minimal standalone reproducer that produces bit-identical output across consecutive runs — so Stages B-F have a deterministic oracle.

Live discharge (RTX 4090, this host)

5 consecutive invocations of the new cublas_fp8_7b_reproducer produced byte-identical JSON:

$ for i in 1 2 3 4 5; do
    MODEL_PATH=/home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
        target/release/examples/cublas_fp8_7b_reproducer
  done
{"cpu_argmax_idx":75311,"cpu_argmax_val":11.554419,
 "gpu_argmax_idx":1057,"gpu_argmax_val":11.132793,
 "correlation":0.986986,
 "cpu_logits_fnv1a":"dd5d13626cb48c40",
 "gpu_logits_fnv1a":"6748eb76f78f8683",
 "agrees_with_cpu":false}
(× 5)

FNV-1a logit fingerprint is identical across all 5 runs → bytes are bit-identical.

What this delivers

Artifact Purpose
crates/aprender-serve/examples/cublas_fp8_7b_reproducer.rs (147 LOC) Minimal reproducer: load model, CPU + cuBLAS FP8 forward, dump JSON. Exit 0 when GPU agrees with CPU (bug fixed), exit 1 when disagrees (current state).
contracts/cublas-fp8-7b-determinism-v1.yaml (v1.0.0) 2 equations + 2 LIVE_DISCHARGED falsifiers + 3 proof obligations

Bug signature (locked at v1.0.0)

Field Value
cpu_argmax_idx 75311
cpu_argmax_val 11.554419
gpu_argmax_idx 1057 (wrong — bug present)
gpu_argmax_val 11.132793
correlation 0.986986
cpu_logits_fnv1a dd5d13626cb48c40
gpu_logits_fnv1a 6748eb76f78f8683
agrees_with_cpu false

Stage F success criterion: gpu_argmax_idx flips to 75311, agrees_with_cpu flips to true, on 5 consecutive runs.

Test plan

  • 5 consecutive runs are bit-identical (DET-001 PASS, live)
  • Signature matches the spec lock (DET-002 PASS, live)
  • Contract YAML lint-passes (cargo test -p aprender-contracts --lib lint_passes_on_real_contracts)
  • Example builds clean (cargo build --example cublas_fp8_7b_reproducer --release -p aprender-serve --features cuda)
  • CI: workspace-test, fmt, contracts-lib

Next

After this lands → Stage B (per-layer parity instrumentation, APR_PER_LAYER_PARITY_DUMP=1) to find the first divergent layer.

🤖 Generated with Claude Code

…S gibberish

SPEC-CUBLAS-FP8-7B-FIX-001 Stage A: a minimal standalone reproducer that
isolates the cuBLAS FP8 7B Q4K forward step and produces bit-identical
JSON output across consecutive runs. Unblocks Stages B-F which need a
deterministic oracle.

## Live discharge (2026-05-22, noah-Lambda-Vector RTX 4090)

5 consecutive invocations produced byte-identical JSON:

```
$ for i in 1 2 3 4 5; do
    MODEL_PATH=/home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
        target/release/examples/cublas_fp8_7b_reproducer
  done
{"cpu_argmax_idx":75311,"cpu_argmax_val":11.554419,
 "gpu_argmax_idx":1057,"gpu_argmax_val":11.132793,
 "correlation":0.986986,
 "cpu_logits_fnv1a":"dd5d13626cb48c40",
 "gpu_logits_fnv1a":"6748eb76f78f8683",
 "agrees_with_cpu":false}
(× 5)
```

FNV-1a fingerprints identical across runs → logit bytes are bit-identical.
The 2026-05-22 bisect failed because `apr qa`'s multi-gate sequence
shared a CUDA context that intermittently poisoned between gates. This
reproducer exercises ONLY the cuBLAS FP8 forward path with controlled
state — eliminating the cross-gate non-determinism.

## What this unlocks

- **Stage B** (per-layer parity) can trust its own outputs because the
  full-forward signature is locked.
- **git bisect** can resume with a deterministic oracle (the reproducer's
  exit code: 0 = agrees with CPU = bug fixed; 1 = disagrees = bug present).
- **Stage F** acceptance is now objective: gpu_argmax_idx changes from
  1057 → 75311 AND agrees_with_cpu flips from false → true, on 5
  consecutive runs.

## Files

- `crates/aprender-serve/examples/cublas_fp8_7b_reproducer.rs` (147 LOC)
  — minimal main(): load 7B Q4K GGUF, CPU forward, cuBLAS forward,
    cosine + linear-fit, FNV-1a fingerprint of logit bytes, single JSON
    line on stdout, exit code reflects argmax agreement
- `contracts/cublas-fp8-7b-determinism-v1.yaml` (v1.0.0)
  — equation `reproducer_bit_identity` + `signature_locks_the_bug`
  — falsifiers FALSIFY-CUBLAS-FP8-DET-{001,002} LIVE-DISCHARGED
  — 3 proof obligations

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Subsumed by #1894 (release PR for v0.35.0). The squash-merge into release/v0.35.0 preserves the per-PR commit message and changes — see PR #1894 commit log. Closing as superseded.

@noahgift noahgift closed this May 22, 2026
auto-merge was automatically disabled May 22, 2026 14:59

Pull request was closed

noahgift added a commit that referenced this pull request May 22, 2026
* chore(fmt): cargo fmt --all (release v0.35.0 baseline)

* chore: README drift fix + apr serve syntax (#1873)

* feat(cublas-fp8): Stage A deterministic reproducer (#1884)

* feat(cublas-fp8): Stage B per-layer parity instrumentation (#1887)

* fix(1864): Golden Output gate must set stop_tokens (#1890)

* chore(release): bump to v0.35.0 + CHANGELOG + README contract count

Workspace 0.34.0 → 0.35.0 across root Cargo.toml + all path-dep callsites
+ regenerate Cargo.lock. CHANGELOG v0.35.0 entry captures the 81-commit
release scope:

1. Distill Phase 1-3 working end-to-end on NVIDIA GB10 Blackwell sm_121
2. MoE (Qwen3) KV cache + streaming SSE + sampling
3. 2026-05-22 dogfood pass: 8 bugs surfaced, 7 fixed. #1864 was a 5-line
   stop_tokens config gap, not a deep cuBLAS FP8 numerical bug — see
   feedback_falsify_simple_before_deep.md

README contract count 1151 → 1153 (post Stage A + Stage B contracts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant