Summary
32B Q4K_M batch inference (--batch-jsonl) crashes on NVIDIA Blackwell GB10 (sm_121, CUDA 13.0). The FP8 cache warmup poisons the CUDA context, causing all subsequent GPU operations to fail with CUDA_ERROR_ILLEGAL_ADDRESS.
Five-Whys
- Why does 32B GPU batch fail?
generate_gpu_resident FAILED: Prefill workspace init failed: CUDA_ERROR_ILLEGAL_ADDRESS
- Why ILLEGAL_ADDRESS? FP8 cache warmup writes to invalid memory on sm_121
- Why does FP8 fail on sm_121? FP8 E4M3 kernels not compatible with Blackwell architecture
- Why is FP8 tried? Default
FP8_PREFILL/FP8_DECODE not disabled for sm_121
- Root cause: Missing architecture detection — sm_121 should auto-disable FP8
Reproduction
# On NVIDIA GB10 (sm_121):
export SKIP_PARITY_GATE=1
apr run checkpoints/qwen2.5-coder-32b-instruct-q4km.apr --prompt "hello" --max-tokens 5 --json
# Output:
# [PMAT-053] FP8 cache warmup failed (non-fatal): CUDA_ERROR_ILLEGAL_ADDRESS (code: 700)
# [GH-480] generate_gpu_resident FAILED: Prefill workspace init failed: CUDA_ERROR_ILLEGAL_ADDRESS
# [CUDA-FAILFAST] Context poisoned during executor lifetime
Workaround
export SKIP_PARITY_GATE=1 FP8_PREFILL=0 FP8_DECODE=0
But even with FP8 disabled, the 32B model's PTX JIT compilation takes too long (120s+ for 64 layers × multiple kernel types) and gets terminated by process managers.
Expected Fix
- Auto-detect sm_121 and disable FP8 in
CudaExecutor::new() (no env var needed)
- Add kernel pre-warming phase that survives long JIT compilation times
- Add provable contract:
gpu_context_health — verify CUDA context is not poisoned after FP8 warmup
Impact
- 7B GPU batch works (fewer kernels, faster JIT)
- 32B GPU batch fails (64 layers, too many kernels, FP8 poisoning)
- Blocks 32B MBPP eval on GPU (current score 74.40% has 18 GPU errors)
Hardware
- NVIDIA GB10, sm_121, CUDA 13.0, 119 GB unified memory
- Driver 580.126.09
- trueno-gpu 0.4.35
Related
Summary
32B Q4K_M batch inference (
--batch-jsonl) crashes on NVIDIA Blackwell GB10 (sm_121, CUDA 13.0). The FP8 cache warmup poisons the CUDA context, causing all subsequent GPU operations to fail withCUDA_ERROR_ILLEGAL_ADDRESS.Five-Whys
generate_gpu_resident FAILED: Prefill workspace init failed: CUDA_ERROR_ILLEGAL_ADDRESSFP8_PREFILL/FP8_DECODEnot disabled for sm_121Reproduction
Workaround
export SKIP_PARITY_GATE=1 FP8_PREFILL=0 FP8_DECODE=0But even with FP8 disabled, the 32B model's PTX JIT compilation takes too long (120s+ for 64 layers × multiple kernel types) and gets terminated by process managers.
Expected Fix
CudaExecutor::new()(no env var needed)gpu_context_health— verify CUDA context is not poisoned after FP8 warmupImpact
Hardware
Related