Skip to content

Add SDPA attention implementation#512

Open
jlamypoirier wants to merge 3 commits intomainfrom
jlp_sdpa-attention
Open

Add SDPA attention implementation#512
jlamypoirier wants to merge 3 commits intomainfrom
jlp_sdpa-attention

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented May 7, 2026

Summary

Flash-attn caps at head_size = 256; head_size = 512 models (e.g. Gemma 4's full-attention layers) currently force the backup path, which materializes the full O(S²) attention matrix and OOMs above ~8K context on H100. Add AttentionImplementation.sdpa so those models can train.

The implementation has two CUDA-aware paths sharing the rest of the layer:

  • CUDA, no sliding window: torch.nested.nested_tensor_from_jagged(values, cu_seqlens) + is_causal=True under EFFICIENT. Each document becomes its own batch element, so cross-document attention is excluded by structure rather than by mask. This is the typical training path.
  • CUDA + window and CPU: dense (1, H, total, D) + attn_mask, reusing backup's preprocessed causal+document mask. MATH does not accept nested + is_causal=True, so the mask path is the only viable form on CPU; on CUDA-with-window the mask is needed because nested + is_causal cannot express sliding window. Per a cluster probe, EFFICIENT also engages on CUDA with explicit attn_mask — only ~4 MiB extra over is_causal for the mask itself.

Both paths manually repeat_interleave K/V across query heads because the fused kernels reject broadcasted GQA inputs.

Auto-fallback simplifies to flash for bf16/fp16 + head_size ≤ 256 + flash available, otherwise sdpa. SDPA now covers every previously-backup case (CPU, windowed without flash, head_size > 256); backup remains as an explicit implementation: backup option but the auto path no longer reaches it.

get_preprocessing_config branches by impl: flash needs cu_seqlens + max_seqlens; sdpa-CUDA-no-window needs only cu_seqlens; sdpa-windowed / sdpa-CPU / backup all need document_index (mask is then built in preprocess and shared).

Tests: SDPA equivalence check parallel to flash via a small _check_packed closure (CUDA bf16); two head_size=320 cases that exercise the SDPA-only regime; windowed cases now exercise SDPA too. Parametrization refactored from _build_test_cases + single-use variant lists into inline for-loops at module level.

Benchmark — H100 bf16, 20 iters after 10 warmup, fwd+bwd wall

Llama-7B-shape (32 heads MHA, head_size=128):

seq docs window backup sdpa-mask sdpa-nested
4K 1 none 18.6 ms / 8.2 GiB 3.5 / 0.42 GiB 7.7 / 0.39 GiB
8K 1 none 74 / 32.4 GiB 12.6 / 0.88 GiB 18.4 / 0.75 GiB
16K 1 none OOM 50 / 2.07 GiB 60 / 1.57 GiB
16K 4×4K none OOM 50 / 2.07 GiB 18.6 / 1.57 GiB
16K 1 4K OOM 50 / 2.07 GiB n/a

Gemma-4 full-attn (16/8 GQA, head_size=512):

seq docs window backup sdpa-mask sdpa-nested
4K 1 none 11 / 4.4 GiB 46 / 0.84 GiB 31 / 0.81 GiB
8K 1 none 42 / 16.7 GiB 161 / 1.67 GiB 93 / 1.55 GiB
16K 1 none OOM 615 / 3.61 GiB 331 / 3.11 GiB
16K 4×4K none OOM 616 / 3.61 GiB 88 / 3.11 GiB
16K 1 4K OOM 612 / 3.61 GiB n/a

Multi-document varlen — the typical training case — is where nested+is_causal pulls ahead of mask by 2.6×–7×: nested processes each doc as its own batch element (4×4K² attention work) while mask materializes the full 16K² matrix even though same-doc cross-attention is then masked out. Backup OOMs above ~8K at these widths.

CPU fp32 sdpa-mask vs backup at small shapes (head_size=64, 4 docs of 1024 = 4K total tokens): backup ~567 ms wall, sdpa-mask ~302 ms — ~1.9× faster on the previously-backup CPU path. Not a use case we ship to, but confirms sdpa-mask doesn't regress compared to backup.

Test plan

  • Local pytest -v -n 4 tests/layers/test_attention.py (CPU): 56 passed
  • Cluster pytest -v -n 8 tests/layers/test_attention.py (CUDA): 56 passed; all SDPA equivalence checks run, including windowed cases

🤖 Generated with Claude Code

jlamypoirier and others added 3 commits May 7, 2026 19:12
Flash-attn errors out at head_size > 256, so head_size=512 models
cannot train without materializing the full O(S²) attention matrix
via the backup path.

Add `AttentionImplementation.sdpa` using `torch.nested` to bridge the
packed-varlen layout to SDPA's batched signature, pinning the EFFICIENT
backend. K/V are manually repeat_interleaved to match Q heads because
the fused kernels reject broadcasted GQA inputs.

Auto-fallback: flash when bf16/fp16 + head_size <= 256 + flash is
available; backup for windowed attention (the sdpa path does not
support sliding window); sdpa otherwise.

Tests: SDPA equivalence check parallel to flash, gated on CUDA + bf16;
two head_size=320 cases exercising the SDPA-only regime; refactored
parametrization from `_build_test_cases` plus single-use variant lists
into a few inline for-loops at module level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SDPA path uses `nested_tensor_from_jagged + is_causal=True` which
has no viable backend on CPU (math rejects nested + is_causal; the
fused EFFICIENT/Flash backends are CUDA-only). Auto previously routed
CPU runs through SDPA and they would crash; route them to backup.

Also widens the SDPA branch to fp32 explicitly: the EFFICIENT backend
engages on CUDA across bf16/fp16/fp32, and benchmarking confirms it
beats backup on memory at every length and matches it on time at
seq_len >= 4096 (backup grows quadratically; SDPA stays near constant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt routed CPU and windowed configurations to backup
because the nested + is_causal=True form has no viable backend on CPU
and cannot express sliding window. SDPA actually works fine in those
cases when given an explicit attn_mask: backup's preprocessing already
builds the combined causal+document mask (and threads sliding window
into it), so the SDPA path can reuse it as-is.

CUDA without a window keeps the nested + is_causal path so EFFICIENT
runs without materializing the mask. CUDA with a window and CPU runs
both fall through to dense + attn_mask, which lets MATH engage on CPU
and reuses the windowed mask on CUDA.

Auto-fallback simplifies to flash-or-sdpa: SDPA now covers every case
backup used to (CPU, windowed without flash, head_size > 256).

Verified on H100 bf16 head_size=512 that the dense + attn_mask form
also engages EFFICIENT (peak 323 MiB vs 319 MiB for is_causal — the
4 MiB delta is the mask itself).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier changed the title Add SDPA attention for head_size > 256 Add SDPA attention implementation May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant