32B batch inference crashes on Blackwell sm_121 — FP8 poisons CUDA context

## Summary

32B Q4K_M batch inference (`--batch-jsonl`) crashes on NVIDIA Blackwell GB10 (sm_121, CUDA 13.0). The FP8 cache warmup poisons the CUDA context, causing all subsequent GPU operations to fail with `CUDA_ERROR_ILLEGAL_ADDRESS`.

## Five-Whys

1. **Why does 32B GPU batch fail?** `generate_gpu_resident FAILED: Prefill workspace init failed: CUDA_ERROR_ILLEGAL_ADDRESS`
2. **Why ILLEGAL_ADDRESS?** FP8 cache warmup writes to invalid memory on sm_121
3. **Why does FP8 fail on sm_121?** FP8 E4M3 kernels not compatible with Blackwell architecture
4. **Why is FP8 tried?** Default `FP8_PREFILL`/`FP8_DECODE` not disabled for sm_121
5. **Root cause:** Missing architecture detection — sm_121 should auto-disable FP8

## Reproduction

```bash
# On NVIDIA GB10 (sm_121):
export SKIP_PARITY_GATE=1
apr run checkpoints/qwen2.5-coder-32b-instruct-q4km.apr --prompt "hello" --max-tokens 5 --json

# Output:
# [PMAT-053] FP8 cache warmup failed (non-fatal): CUDA_ERROR_ILLEGAL_ADDRESS (code: 700)
# [GH-480] generate_gpu_resident FAILED: Prefill workspace init failed: CUDA_ERROR_ILLEGAL_ADDRESS
# [CUDA-FAILFAST] Context poisoned during executor lifetime
```

## Workaround

```bash
export SKIP_PARITY_GATE=1 FP8_PREFILL=0 FP8_DECODE=0
```

But even with FP8 disabled, the 32B model's PTX JIT compilation takes too long (120s+ for 64 layers × multiple kernel types) and gets terminated by process managers.

## Expected Fix

1. Auto-detect sm_121 and disable FP8 in `CudaExecutor::new()` (no env var needed)
2. Add kernel pre-warming phase that survives long JIT compilation times
3. Add provable contract: `gpu_context_health` — verify CUDA context is not poisoned after FP8 warmup

## Impact

- 7B GPU batch works (fewer kernels, faster JIT)
- 32B GPU batch fails (64 layers, too many kernels, FP8 poisoning)
- Blocks 32B MBPP eval on GPU (current score 74.40% has 18 GPU errors)

## Hardware

- NVIDIA GB10, sm_121, CUDA 13.0, 119 GB unified memory
- Driver 580.126.09
- trueno-gpu 0.4.35

## Related

- GH-480: Blackwell sm_121 PTX JIT bug (backward branch patching)
- SKIP_PARITY_GATE: FP rounding parity check bypass for sm_121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32B batch inference crashes on Blackwell sm_121 — FP8 poisons CUDA context #542

Summary

Five-Whys

Reproduction

Workaround

Expected Fix

Impact

Hardware

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

32B batch inference crashes on Blackwell sm_121 — FP8 poisons CUDA context #542

Description

Summary

Five-Whys

Reproduction

Workaround

Expected Fix

Impact

Hardware

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions