Add SDPA attention implementation#512
Open
jlamypoirier wants to merge 3 commits intomainfrom
Open
Conversation
Flash-attn errors out at head_size > 256, so head_size=512 models cannot train without materializing the full O(S²) attention matrix via the backup path. Add `AttentionImplementation.sdpa` using `torch.nested` to bridge the packed-varlen layout to SDPA's batched signature, pinning the EFFICIENT backend. K/V are manually repeat_interleaved to match Q heads because the fused kernels reject broadcasted GQA inputs. Auto-fallback: flash when bf16/fp16 + head_size <= 256 + flash is available; backup for windowed attention (the sdpa path does not support sliding window); sdpa otherwise. Tests: SDPA equivalence check parallel to flash, gated on CUDA + bf16; two head_size=320 cases exercising the SDPA-only regime; refactored parametrization from `_build_test_cases` plus single-use variant lists into a few inline for-loops at module level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SDPA path uses `nested_tensor_from_jagged + is_causal=True` which has no viable backend on CPU (math rejects nested + is_causal; the fused EFFICIENT/Flash backends are CUDA-only). Auto previously routed CPU runs through SDPA and they would crash; route them to backup. Also widens the SDPA branch to fp32 explicitly: the EFFICIENT backend engages on CUDA across bf16/fp16/fp32, and benchmarking confirms it beats backup on memory at every length and matches it on time at seq_len >= 4096 (backup grows quadratically; SDPA stays near constant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt routed CPU and windowed configurations to backup because the nested + is_causal=True form has no viable backend on CPU and cannot express sliding window. SDPA actually works fine in those cases when given an explicit attn_mask: backup's preprocessing already builds the combined causal+document mask (and threads sliding window into it), so the SDPA path can reuse it as-is. CUDA without a window keeps the nested + is_causal path so EFFICIENT runs without materializing the mask. CUDA with a window and CPU runs both fall through to dense + attn_mask, which lets MATH engage on CPU and reuses the windowed mask on CUDA. Auto-fallback simplifies to flash-or-sdpa: SDPA now covers every case backup used to (CPU, windowed without flash, head_size > 256). Verified on H100 bf16 head_size=512 that the dense + attn_mask form also engages EFFICIENT (peak 323 MiB vs 319 MiB for is_causal — the 4 MiB delta is the mask itself). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flash-attn caps at
head_size = 256;head_size = 512models (e.g. Gemma 4's full-attention layers) currently force thebackuppath, which materializes the full O(S²) attention matrix and OOMs above ~8K context on H100. AddAttentionImplementation.sdpaso those models can train.The implementation has two CUDA-aware paths sharing the rest of the layer:
torch.nested.nested_tensor_from_jagged(values, cu_seqlens)+is_causal=Trueunder EFFICIENT. Each document becomes its own batch element, so cross-document attention is excluded by structure rather than by mask. This is the typical training path.(1, H, total, D)+attn_mask, reusing backup's preprocessed causal+document mask. MATH does not accept nested +is_causal=True, so the mask path is the only viable form on CPU; on CUDA-with-window the mask is needed because nested +is_causalcannot express sliding window. Per a cluster probe, EFFICIENT also engages on CUDA with explicitattn_mask— only ~4 MiB extra overis_causalfor the mask itself.Both paths manually
repeat_interleaveK/V across query heads because the fused kernels reject broadcasted GQA inputs.Auto-fallback simplifies to
flashforbf16/fp16+head_size ≤ 256+ flash available, otherwisesdpa. SDPA now covers every previously-backupcase (CPU, windowed without flash, head_size > 256);backupremains as an explicitimplementation: backupoption but the auto path no longer reaches it.get_preprocessing_configbranches by impl: flash needscu_seqlens+max_seqlens; sdpa-CUDA-no-window needs onlycu_seqlens; sdpa-windowed / sdpa-CPU / backup all needdocument_index(mask is then built inpreprocessand shared).Tests: SDPA equivalence check parallel to flash via a small
_check_packedclosure (CUDA bf16); twohead_size=320cases that exercise the SDPA-only regime; windowed cases now exercise SDPA too. Parametrization refactored from_build_test_cases+ single-use variant lists into inline for-loops at module level.Benchmark — H100 bf16, 20 iters after 10 warmup, fwd+bwd wall
Llama-7B-shape (32 heads MHA, head_size=128):
Gemma-4 full-attn (16/8 GQA, head_size=512):
Multi-document varlen — the typical training case — is where nested+is_causal pulls ahead of mask by 2.6×–7×: nested processes each doc as its own batch element (4×4K² attention work) while mask materializes the full 16K² matrix even though same-doc cross-attention is then masked out. Backup OOMs above ~8K at these widths.
CPU fp32 sdpa-mask vs backup at small shapes (head_size=64, 4 docs of 1024 = 4K total tokens): backup ~567 ms wall, sdpa-mask ~302 ms — ~1.9× faster on the previously-backup CPU path. Not a use case we ship to, but confirms sdpa-mask doesn't regress compared to backup.
Test plan
pytest -v -n 4 tests/layers/test_attention.py(CPU): 56 passedpytest -v -n 8 tests/layers/test_attention.py(CUDA): 56 passed; all SDPA equivalence checks run, including windowed cases🤖 Generated with Claude Code