Add SDPA attention implementation by jlamypoirier · Pull Request #512 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-07T23:13:15Z

Summary

Flash-attn caps at head_size = 256; head_size = 512 models (e.g. Gemma 4's full-attention layers) currently force the backup path, which materializes the full O(S²) attention matrix and OOMs above ~8K context on H100. Add AttentionImplementation.sdpa so those models can train.

The implementation has two CUDA-aware paths sharing the rest of the layer:

CUDA, no sliding window: torch.nested.nested_tensor_from_jagged(values, cu_seqlens) + is_causal=True under EFFICIENT. Each document becomes its own batch element, so cross-document attention is excluded by structure rather than by mask. This is the typical training path.
CUDA + window and CPU: dense (1, H, total, D) + attn_mask, reusing backup's preprocessed causal+document mask. MATH does not accept nested + is_causal=True, so the mask path is the only viable form on CPU; on CUDA-with-window the mask is needed because nested + is_causal cannot express sliding window. Per a cluster probe, EFFICIENT also engages on CUDA with explicit attn_mask — only ~4 MiB extra over is_causal for the mask itself.

Both paths manually repeat_interleave K/V across query heads because the fused kernels reject broadcasted GQA inputs.

Auto-fallback simplifies to flash for bf16/fp16 + head_size ≤ 256 + flash available, otherwise sdpa. SDPA now covers every previously-backup case (CPU, windowed without flash, head_size > 256); backup remains as an explicit implementation: backup option but the auto path no longer reaches it.

get_preprocessing_config branches by impl: flash needs cu_seqlens + max_seqlens; sdpa-CUDA-no-window needs only cu_seqlens; sdpa-windowed / sdpa-CPU / backup all need document_index (mask is then built in preprocess and shared).

Tests: SDPA equivalence check parallel to flash via a small _check_packed closure (CUDA bf16); two head_size=320 cases that exercise the SDPA-only regime; windowed cases now exercise SDPA too. Parametrization refactored from _build_test_cases + single-use variant lists into inline for-loops at module level.

Benchmark — H100 bf16, 20 iters after 10 warmup, fwd+bwd wall

Llama-7B-shape (32 heads MHA, head_size=128):

seq	docs	window	backup	sdpa-mask	sdpa-nested
4K	1	none	18.6 ms / 8.2 GiB	3.5 / 0.42 GiB	7.7 / 0.39 GiB
8K	1	none	74 / 32.4 GiB	12.6 / 0.88 GiB	18.4 / 0.75 GiB
16K	1	none	OOM	50 / 2.07 GiB	60 / 1.57 GiB
16K	4×4K	none	OOM	50 / 2.07 GiB	18.6 / 1.57 GiB
16K	1	4K	OOM	50 / 2.07 GiB	n/a

Gemma-4 full-attn (16/8 GQA, head_size=512):

seq	docs	window	backup	sdpa-mask	sdpa-nested
4K	1	none	11 / 4.4 GiB	46 / 0.84 GiB	31 / 0.81 GiB
8K	1	none	42 / 16.7 GiB	161 / 1.67 GiB	93 / 1.55 GiB
16K	1	none	OOM	615 / 3.61 GiB	331 / 3.11 GiB
16K	4×4K	none	OOM	616 / 3.61 GiB	88 / 3.11 GiB
16K	1	4K	OOM	612 / 3.61 GiB	n/a

Multi-document varlen — the typical training case — is where nested+is_causal pulls ahead of mask by 2.6×–7×: nested processes each doc as its own batch element (4×4K² attention work) while mask materializes the full 16K² matrix even though same-doc cross-attention is then masked out. Backup OOMs above ~8K at these widths.

CPU fp32 sdpa-mask vs backup at small shapes (head_size=64, 4 docs of 1024 = 4K total tokens): backup ~567 ms wall, sdpa-mask ~302 ms — ~1.9× faster on the previously-backup CPU path. Not a use case we ship to, but confirms sdpa-mask doesn't regress compared to backup.

Test plan

Local pytest -v -n 4 tests/layers/test_attention.py (CPU): 56 passed
Cluster pytest -v -n 8 tests/layers/test_attention.py (CUDA): 56 passed; all SDPA equivalence checks run, including windowed cases

🤖 Generated with Claude Code

Flash-attn errors out at head_size > 256, so head_size=512 models cannot train without materializing the full O(S²) attention matrix via the backup path. Add `AttentionImplementation.sdpa` using `torch.nested` to bridge the packed-varlen layout to SDPA's batched signature, pinning the EFFICIENT backend. K/V are manually repeat_interleaved to match Q heads because the fused kernels reject broadcasted GQA inputs. Auto-fallback: flash when bf16/fp16 + head_size <= 256 + flash is available; backup for windowed attention (the sdpa path does not support sliding window); sdpa otherwise. Tests: SDPA equivalence check parallel to flash, gated on CUDA + bf16; two head_size=320 cases exercising the SDPA-only regime; refactored parametrization from `_build_test_cases` plus single-use variant lists into a few inline for-loops at module level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The SDPA path uses `nested_tensor_from_jagged + is_causal=True` which has no viable backend on CPU (math rejects nested + is_causal; the fused EFFICIENT/Flash backends are CUDA-only). Auto previously routed CPU runs through SDPA and they would crash; route them to backup. Also widens the SDPA branch to fp32 explicitly: the EFFICIENT backend engages on CUDA across bf16/fp16/fp32, and benchmarking confirms it beats backup on memory at every length and matches it on time at seq_len >= 4096 (backup grows quadratically; SDPA stays near constant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous attempt routed CPU and windowed configurations to backup because the nested + is_causal=True form has no viable backend on CPU and cannot express sliding window. SDPA actually works fine in those cases when given an explicit attn_mask: backup's preprocessing already builds the combined causal+document mask (and threads sliding window into it), so the SDPA path can reuse it as-is. CUDA without a window keeps the nested + is_causal path so EFFICIENT runs without materializing the mask. CUDA with a window and CPU runs both fall through to dense + attn_mask, which lets MATH engage on CPU and reuses the windowed mask on CUDA. Auto-fallback simplifies to flash-or-sdpa: SDPA now covers every case backup used to (CPU, windowed without flash, head_size > 256). Verified on H100 bf16 head_size=512 that the dense + attn_mask form also engages EFFICIENT (peak 323 MiB vs 319 MiB for is_causal — the 4 MiB delta is the mask itself). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jlamypoirier and others added 3 commits May 7, 2026 19:12

jlamypoirier changed the title ~~Add SDPA attention for head_size > 256~~ Add SDPA attention implementation May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SDPA attention implementation#512

Add SDPA attention implementation#512
jlamypoirier wants to merge 3 commits intomainfrom
jlp_sdpa-attention

jlamypoirier commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark — H100 bf16, 20 iters after 10 warmup, fwd+bwd wall

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented May 7, 2026 •

edited

Loading