[PyTorch] Add `pad_between_seqs` support for non-CP and CP (A2A and P2P) with FA3 + THD (varlen) by sudhakarsingh27 · Pull Request #2596 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-01-14T01:12:54Z

Description

TLDR

Enable pad_between_seqs=True for FlashAttention 3 with THD format — both for context parallelism (A2A and P2P comm types) and non-CP paths. Previously pad_between_seqs was only supported with FusedAttention.

Problem

When using THD format with variable-length sequences, sequences are padded for divisibility across CP ranks. With pad_between_seqs=True, the attention kernel needs to know actual (unpadded) token counts so it doesn't compute attention over padding tokens. FusedAttention already handled this via cu_seqlens_q_padded, but FlashAttention (both FA2 and FA3) had pad_between_seqs hardcoded to False in the CP path, and FA2 was entirely disabled for pad_between_seqs + thd. FA3 can natively handle this via its seqused_q/seqused_k mechanism.

Solution

Use FA3's seqused_q/seqused_k tensors to communicate actual token counts per batch element. Pass cu_seqlens_q_padded for tensor memory layout while deriving seqused_q = cu_seqlens_q[1:] - cu_seqlens_q[:-1] from the real cu_seqlens. This applies to both the CP path (A2A and P2P) and the non-CP path.

Fixes #2399

Type of change

New feature (non-breaking change which adds functionality)

Changes

Please list the changes introduced in this PR:

context_parallel.py

get_fa_args(): Add seqused_q/seqused_k parameters, pass through to FA3 forward and backward positional arg lists (replacing hardcoded Nones).
cp_p2p_fwd_flash_attn() / cp_p2p_bwd_flash_attn(): Accept pad_between_seqs, cu_seqlens_q_padded, cu_seqlens_kv_padded. When enabled, derive seqused tensors and override cu_seqlens to padded versions (with half-padding for lower-triangle/upper-triangle sections).
AttnFuncWithCPAndKVP2P: Thread pad_between_seqs and padded cu_seqlens through all forward/backward cp_p2p_fwd/bwd_flash_attn call sites. Save ctx.pad_between_seqs for backward.
AttnFuncWithCPAndQKVOA2A.forward(): Add pad_between_seqs parameter. When enabled with FA3+THD, derive seqused and swap cu_seqlens for padded versions before calling get_fa_args().
AttnFuncWithCPAndQKVOA2A.backward(): Same seqused/cu_seqlens override. Use zeros_like (not empty_like) for gradient init when pad_between_seqs since FA3 skips padding positions. Add extra None in return tuple for the new pad_between_seqs gradient slot.
attn_forward_func_with_cp(): Pass pad_between_seqs in A2A args list.

backends.py

FlashAttention.forward(): Accept cu_seqlens_q_padded/cu_seqlens_kv_padded. Detect pad_between_seqs by comparing padded vs actual cu_seqlens. Pass padded cu_seqlens to CP path. For non-CP FA3 path, derive and pass seqused_q/seqused_k.

dot_product_attention.py

Pass cu_seqlens_q_padded/cu_seqlens_kv_padded through to FlashAttention.

utils.py

Only disable FA2 (not FA3) when pad_between_seqs + thd. FA3 handles this natively via seqused.

test_attention_with_cp.py

Add @pytest.mark.parametrize("pad_between_seqs", [False, True]) to flash attention CP tests.
Skip pad_between_seqs=True for non-THD formats, when FA3 is not installed, and for a2a+p2p comm type (not yet supported).

run_attention_with_cp.py

Thread pad_between_seqs through generate_input_shapes() and run_dpa_with_cp().
When pad_between_seqs, set cu_seqlens_q to actual lengths (not just for FusedAttention).
Handle FA3 backward NaN at padding positions: nan_to_num(nan=0.0).
Zero padding positions explicitly before comparison (FA3 doesn't guarantee zeros at padding slots).
Add tensor names to NaN/Inf assertion messages for debuggability.

test_attention.py

Group FlashAttention with FusedAttention for padded input/output handling in _run_dot_product_attention() (previously FlashAttention used original unpadded inputs).
Pass cu_seqlens_q_padded/cu_seqlens_kv_padded and pad_between_seqs to DPA call for FlashAttention backend.
Add pad_between_seqs=True to parametrize with skip for non-THD formats.

New Tests

CP tests (`test_attention_with_cp.py`)

Added @pytest.mark.parametrize("pad_between_seqs", [False, True]) to test_cp_with_flash_attention. Skip conditions: non-THD formats, FA3 not installed, a2a+p2p comm type.

5 new tests that run (all pad_between_seqs=True, thd, bf16):

Test	CP comm	Model config
`True-p2p-thd-cp_1_0-bf16`	P2P	causal, 1 head
`True-p2p-thd-cp_2_1-bf16`	P2P	causal, 2 heads
`True-a2a-thd-cp_1_0-bf16`	A2A	causal, 1 head
`True-a2a-thd-cp_1_2-bf16`	A2A	causal, sliding window
`True-a2a-thd-cp_2_1-bf16`	A2A	causal, 2 heads

Non-CP tests (`test_attention.py`)

Added True to @pytest.mark.parametrize("pad_between_seqs", [False, True]) on test_dot_product_attention, with skip for non-THD. Also changed _run_dot_product_attention so FlashAttention uses padded inputs/cu_seqlens and receives pad_between_seqs=True.

48 new test IDs collected, but all are skipped because the main parametrize uses qkv_layout=None (defaults to sbhd, not thd). The non-CP pad_between_seqs + FA3 code path is exercised indirectly when other test functions call test_dot_product_attention with qkv_layout="thd_thd_thd" (e.g., test_dpa_softmax_thd).

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-01-14T01:19:29Z

Greptile Summary

This PR adds pad_between_seqs=True support for FlashAttention 3 with THD (varlen) format, covering both context-parallel paths (A2A and P2P) and the non-CP path. It uses FA3's native seqused_q/seqused_k mechanism to communicate actual token counts per batch element while using padded cu_seqlens for memory layout.

utils.py: FA2 and FA4 are now both disabled when pad_between_seqs=True + THD, leaving FA3 as the only valid backend; UnfusedDotProductAttention is also disabled for this combination.
backends.py: Non-CP FA3 path passes cu_seqlens_q_padded as the memory-layout descriptor and adds seqused_q/seqused_k kwargs derived from the unpadded cu_seqlens; CP path propagates padded seqlens through to attn_forward_func_with_cp.
context_parallel.py: cp_p2p_fwd/bwd_flash_attn and AttnFuncWithCPAndQKVOA2A derive seqused tensors and override cu_seqlens to padded versions (with per-section half-padding logic) behind qkv_format == \"thd\" guards; backward gradient buffers use zeros_like when pad_between_seqs=True to prevent garbage from FA3 tile spillover.
Test infrastructure: test_attention_with_cp.py gains a batched-torchrun dispatch layer that groups parametrized CP configs into torchrun batches and re-runs singletons to isolate failures.

Confidence Score: 4/5

The core FA3 + THD + pad_between_seqs logic is correct for all intended code paths. The main residual risk is a crash when FA3 is chosen for bshd+padding+pad_between_seqs=True because cu_seqlens_q_padded is None in that branch (already flagged in earlier review comments).

The previously identified issues with FA4 not being disabled and wrong cu_seqlens being passed to the FA3 memory-layout arg have both been fixed in this PR. The A2A and P2P CP paths correctly guard seqused derivation with qkv_format == "thd" checks, and the backward gradient init with zeros_like is correct. The one unresolved concern is in the non-CP FA3 path: when pad_between_seqs=True is combined with bshd+padding mask and FA3, cu_seqlens_q_padded is None, which would be passed as FA3's cu_seqlens arg — a confirmed crash. This scenario is already discussed in earlier review comments and no current caller triggers it, but the code lacks a defensive guard.

transformer_engine/pytorch/attention/dot_product_attention/backends.py (non-CP FA3 seqused block lacks qkv_format guard); tests/pytorch/attention/test_attention_with_cp.py (new batch dispatch architecture)

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Non-CP FA3 path now correctly passes cu_seqlens_q_padded as memory layout and adds seqused_q/k; seqused block lacks a qkv_format guard (addressed in prior review comments).
transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py	cp_p2p_fwd/bwd_flash_attn and A2A forward/backward all correctly guard seqused derivation behind qkv_format == "thd"; zeros_like used for backward gradient init when pad_between_seqs=True.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	FA2, FA4, and UnfusedDotProductAttention are all correctly disabled for pad_between_seqs+THD; FA4 disable is now present (previous concern was addressed).
tests/pytorch/attention/test_attention_with_cp.py	New batched-torchrun dispatch layer with dry-run collect phase, batch grouping, and singleton retry; pad_between_seqs parametrize added with appropriate skip conditions.
tests/pytorch/attention/run_attention_with_cp.py	fa_pad_between_seqs wired through generate_input_shapes and run_dpa_with_cp; NaN/padding handling for FA3 backward added; batch entry point with all_reduce result aggregation and atomic JSON writes.
tests/pytorch/attention/test_attention.py	FlashAttention grouped with FusedAttention for padded input/cu_seqlens handling; pad_between_seqs parametrized with skip for non-THD formats.
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	cu_seqlens_q_padded/cu_seqlens_kv_padded threaded through to FlashAttention; logic is correct pass-through.
qa/L1_pytorch_distributed_unittest/test.sh	Parallel CP test execution added with >=8 GPU detection; uses disjoint GPU sets and separate MASTER_PORTs to avoid conflicts.
qa/L3_pytorch_FA_versions_test/test.sh	New FA-version test harness with sm-arch-based CP_FA_VERSION selection, FA3 source build skip when pre-installed, and parallel test execution when >=5 GPUs available.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["FlashAttention.forward\n(backends.py)"] --> B{context_parallel?}
    B -- Yes --> C["attn_forward_func_with_cp\npasses cu_seqlens_q_padded\n+ pad_between_seqs"]
    B -- No --> D{use_flash_attn_3?}
    D -- Yes --> E["fa_optional_forward_args_thd:\ncu_seqlens_q_padded (layout)\n+\nseqused_q = cu_seqlens_q diff\n(actual counts)"]
    D -- No/FA2 --> F["fa_optional_forward_args_thd:\ncu_seqlens_q (no seqused)"]
    C --> G{cp_comm_type}
    G -- p2p --> H["cp_p2p_fwd_flash_attn\nDerive seqused from cu_seqlens_per_step\nOverride cu_seqlens to padded\n(with section half-padding)"]
    G -- a2a --> I["AttnFuncWithCPAndQKVOA2A\nDerive seqused from cu_seqlens\nOverride cu_seqlens to padded\nGuard: qkv_format==thd"]
    H --> J["get_fa_args(seqused_q, seqused_k)\nflash_attn_varlen_func_v3"]
    I --> J
    E --> J
    J --> K["FA3 kernel\nUses padded layout for memory\nUses seqused to skip padding tokens"]
    K --> L["Backward:\nzeros_like for dq/dk/dv\nFA3 only writes to valid positions"]

_{Reviews (38): Last reviewed commit: "Fix parallel CP test port conflicts and ..." | Re-trigger Greptile}

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-14T01:19:28Z

+                        # if `pad_between_seqs` is True, provide flash_attn_3 with `seqused_q` and `seqused_k`
+                        # in addition to `cu_seqlens_q_padded` and `cu_seqlens_kv_padded` to avoid affecting the
+                        # padding positions.
+                        if pad_between_seqs:
+                            fa_3_optional_forward_kwargs["seqused_q"] = (
+                                cu_seqlens_q[1:] - cu_seqlens_q[:-1]
+                            )
+                            fa_3_optional_forward_kwargs["seqused_k"] = (
+                                cu_seqlens_kv[1:] - cu_seqlens_kv[:-1]
+                            )


style: verify that flash_attn_3 with seqused_q/seqused_k truly avoids writing to padding positions - the related issue #2391 mentions "we need to manually set the output of the padded positions to zero" (similar to how FusedAttention zeroes output in C++ for THD format). if flash_attn_3 doesn't zero these internally, output may have garbage values in padded positions. have you verified that flash_attn_3 correctly handles padding internally with these parameters?

sudhakarsingh27 · 2026-03-11T00:07:02Z

/te-ci pytorch L2

cyanguwa · 2026-03-18T05:12:36Z

+
+        pad_between_seqs = False
+        if qkv_format == "thd" and cu_seqlens_q_padded is not None:
+            pad_between_seqs = not torch.equal(cu_seqlens_q_padded, cu_seqlens_q)


Can pad_between_seqs be decided ahead of time, passed by the user or something? This wouldn't be CUDA Graph-compatible right?

This pattern exists in dpa.py as well. But yes, it's definitely redundant here

sudhakarsingh27 · 2026-03-20T03:19:12Z

/te-ci pytorch L1

sudhakarsingh27 · 2026-04-08T17:37:21Z

/te-ci pytorch L3

sudhakarsingh27 · 2026-04-09T05:28:10Z

/te-ci pytorch L3

sudhakarsingh27 · 2026-04-09T23:34:10Z

/te-ci pytorch L3

cyanguwa · 2026-04-17T22:11:11Z

+        if not FlashAttentionUtils.v3_is_installed:
+            pytest.skip("pad_between_seqs with CP requires Flash Attention v3!")
+        if cp_comm_type == "a2a+p2p":
+            pytest.skip("pad_between_seqs is not yet supported with A2A+P2P CP comm type!")


What about AG?

cyanguwa · 2026-04-17T22:15:22Z

+    if pad_between_seqs:
+        dq, dk, dv = [torch.zeros_like(x) for x in [q_part, k_part, v_part]]
+    else:
+        dq, dk, dv = [torch.empty_like(x) for x in [q_part, k_part, v_part]]


Just to confirm, we can't do this for fwd, right? Because fwd output is not allocated by us.

It's a limitation in Flash Attention code - forward never mutates out (so pre-zeroing is overwritten), backward treats dq/dk/dv as in-place mutable (so pre-zeroing sticks). Also this zeroing out works only for CP code where we can provide the args.

None of the zeroing works for non-CP path because we only have the forward call in TE.

FA3 / Hopper (hopper/flash_attn_interface.py) - Forward: mutates_args=() _ namespace flash_attn_3::_flash_attn_forward - Backward: mutates_args=("dq", "dk", "dv") _ namespace flash_attn_3::_flash_attn_backward

sudhakarsingh27 · 2026-04-24T19:51:48Z

/te-ci pytorch L3

Add support for padding between sequences (pad_between_seqs) in the FlashAttention 3 backend when used with context parallelism (CP). Key changes: - backends.py: Pass fa_pad_between_seqs through to FA3 forward/backward - context_parallel.py: Handle pad_between_seqs in A2A and P2P CP paths, zero FA3 padding garbage in CP forward, fix a2a backward alignment - dot_product_attention.py: Auto-detect pad_between_seqs from cu_seqlens - utils.py: Gate FA3 deterministic backward for hdim>=256, fix flash_attn_supported override for cross-attention and large head_dim, disable UnfusedDotProductAttention for pad_between_seqs, add SM100+ FA3 skip Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Add test parametrization for pad_between_seqs in flash attention tests. Update run_attention_with_cp.py to support the new parameter and fix batch boundary alignment in the non-CP FA3 path. Run tests in parallel when multiple GPUs are available. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Add deterministic CP test runs to L3 FA versions test. Support TE_PATH positional arg and fix GPU threshold for parallel test execution. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…raint The previous check disabled FA3 for deterministic mode whenever head_dim_qk > 128, which was overly conservative — FA3 forward supports deterministic execution at any head dim. The actual constraint from flash_api.cpp is that the backward pass does not support deterministic mode when max(head_size, head_size_v) >= 256. Narrow the gate to only disable FA3 during training (backward) and raise the threshold to >= 256, checking both head_dim_qk and head_dim_v to handle MLA configs with asymmetric head dimensions. Ref: https://github.com/Dao-AILab/flash-attention/blob/ac6f2eb5/hopper/flash_api.cpp#L1370 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

The pad_between_seqs gate in get_attention_backend only disabled FlashAttention 2, letting FA4 leak through to the test-time fused-vs-flash comparison. On B200 runners that install flash-attn-4, this caused test_dpa_qkv_layout_thd to compare FusedAttention against an FA4 output whose padded positions contain garbage, producing 48 numerics failures in L3_pytorch_FA_versions_test--B200_1GPU. The log message already claimed FA4 would be disabled — this change makes the code match the message: set use_flash_attention_4 = False alongside use_flash_attention_2 when pad_between_seqs is True. FA3 continues to support pad_between_seqs via seqused_q/seqused_k. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…_attn_pad_bw_seqs

sudhakarsingh27 · 2026-04-25T00:03:07Z

/te-ci pytorch L3

FA4 install brings in nvidia-cutlass-dsl, whose `import cutlass` adds cutlass/base_dsl/ to sys.path. That directory contains a utils/ package that shadows tests/pytorch/utils.py, breaking collection of test_attention_with_cp.py with: ImportError: cannot import name 'ModelConfig' from 'utils' Prepend $TE_PATH/tests/pytorch to PYTHONPATH so the local utils.py is always resolved first, regardless of what FA4 dependencies install. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…_attn_pad_bw_seqs

…s its a known cudnn issue Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…_attn_pad_bw_seqs

for more information, see https://pre-commit.ci

…ransformerEngine into flash_attn_pad_bw_seqs

sudhakarsingh27 · 2026-04-29T20:58:06Z

/te-ci pytorch L3

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-04-30T19:53:20Z

/te-ci pytorch L3

sudhakarsingh27 · 2026-05-01T15:43:28Z

/te-ci pytorch L3

…_attn_pad_bw_seqs

…ransformerEngine into flash_attn_pad_bw_seqs

sudhakarsingh27 · 2026-05-04T19:05:31Z

/te-ci pytorch L3

PR 2596 added deterministic CP runs to the L3 FA-versions matrix, multiplying CP wall time across every FA version and causing CI timeouts (pipeline 50243000). Run CP tests once per arch instead, picking the FA version each arch's CP code path actually supports: - sm90 (H100): FA3 3.0.0b1 - context_parallel.py is FA3-only on Hopper (use_flash_attn_3 threaded throughout, FA4 not wired in; pad_between_seqs gated on use_flash_attn_3 at lines 1038, 1366) - sm>90 (B200): latest FA4 - FA3 is not built/installed for sm>90 Non-CP test_attention.py still runs for every FA version in the array. Also drop FA 2.7.3 from the sm90 list (no longer maintained as a target) and bump the FA4 pin from 4.0.0b8 to 4.0.0b11. b8 has an SM90 backward kernel bug fixed by upstream PR NVIDIA#2513 in b11 (get_smem_store_C() got multiple values for argument 'transpose'). Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Three follow-ups on top of 13ba004 (L3 per-arch CP gating): 1. Skip the inline FA3 source build when flash_attn_interface is already importable. This makes the script a no-op on FA3 install when the base image has FA3 baked in (companion to TE !573 on te_ci, which auto-sets INSTALL_FA3=${RUN_L3_TESTS} so FA3 is preinstalled for L3 pipelines). Saves ~20 min of L3 H100 wall time once both land. Falls back to the existing inline build when FA3 is not pre-installed. 2. Suffix junit XMLs with the FA version (pytest_test_attention_fa2_8_3.xml etc.) so per-iteration results are preserved instead of overwritten. Pipeline 50348672 had no per-FA timing visibility because pytest.xml was clobbered by each loop iteration. 3. Include FA version in test_fail messages so CI dashboards show which FA iteration caused a failure (was "test_attention.py", now "test_attention.py (FA 2.8.3)"). Also fold the CP_FA_VERSION assignment into the same if-block as FA_versions (was a separate if-block immediately after) since the two are arch-keyed in lockstep. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…_attn_pad_bw_seqs

sudhakarsingh27 · 2026-05-06T23:07:36Z

/te-ci pytorch L3

…_attn_pad_bw_seqs

Port CP test batching from sudhakars/cp_test_batching_pr (PR NVIDIA#2965). Groups parametrized configs into batches of CP_TEST_BATCH_SIZE (default 16) and runs each batch in a single torchrun invocation, amortizing the ~9s NCCL init overhead across configs instead of paying it per test. This is a temporary commit to validate batching under CI on the flash_attn_pad_bw_seqs branch — intended to be reverted after the run. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-05-09T06:43:53Z

/te-ci pytorch L3

Two fixes on top of the batching port (ca8b383): 1. test.sh: assign distinct MASTER_PORT (29500 / 29501) to the two parallel pytest sessions so their torchrun batches don't collide. Without this, both sessions inherit the same MASTER_PORT and the second one fails with EADDRINUSE on every batch. 2. Restore the deterministic THD OOM skip that the batching PR dropped when it flattened the `if deterministic:` block. Without it, 5 fused-attention THD configs OOM on sm90 under NVTE_ALLOW_NONDETERMINISTIC_ALGO=0. Validated: 8×H100, parallel non-det (38 passed) + det (26 passed, 5 THD OOM correctly skipped), zero EADDRINUSE. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-05-09T08:00:32Z

/te-ci pytorch L3

sudhakarsingh27 requested a review from cyanguwa January 14, 2026 01:12

sudhakarsingh27 self-assigned this Jan 14, 2026

greptile-apps Bot reviewed Jan 14, 2026

View reviewed changes

sudhakarsingh27 force-pushed the flash_attn_pad_bw_seqs branch from ea51821 to e338049 Compare March 10, 2026 23:37

sudhakarsingh27 changed the title ~~Flash attn pad bw seqs~~ [PyTorch] Add pad_between_seqs support for A2A and P2P CP with FA3 + THD Mar 11, 2026

sudhakarsingh27 added the 2.14.0 label Mar 12, 2026

cyanguwa reviewed Mar 18, 2026

View reviewed changes

cyanguwa reviewed Mar 20, 2026

View reviewed changes

sudhakarsingh27 added 2.15.0 and removed 2.14.0 labels Mar 25, 2026

sudhakarsingh27 force-pushed the flash_attn_pad_bw_seqs branch from b0a3c64 to 057f406 Compare April 9, 2026 05:18

sudhakarsingh27 force-pushed the flash_attn_pad_bw_seqs branch from 00bdc92 to 0f48ebc Compare April 10, 2026 15:04

cyanguwa reviewed Apr 17, 2026

View reviewed changes

ptrendx added this to the 2.15 milestone Apr 23, 2026

sudhakarsingh27 commented Apr 24, 2026

View reviewed changes

Comment thread qa/L3_pytorch_FA_versions_test/test.sh Outdated

Comment thread qa/L3_pytorch_FA_versions_test/test.sh Outdated

Comment thread tests/pytorch/attention/run_attention_with_cp.py Outdated

Comment thread qa/L3_pytorch_FA_versions_test/test.sh Outdated

sudhakarsingh27 added 4 commits April 24, 2026 15:35

[QA] Add CP deterministic tests to L3 and support TE_PATH in FA test

34e3d62

Add deterministic CP test runs to L3 FA versions test. Support TE_PATH positional arg and fix GPU threshold for parallel test execution. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the flash_attn_pad_bw_seqs branch from 9c01601 to 4745f98 Compare April 24, 2026 23:02

sudhakarsingh27 changed the title ~~[PyTorch] Add pad_between_seqs support for A2A and P2P CP with FA3 + THD~~ [PyTorch] Add pad_between_seqs support for non-CP and CP (A2A and P2P) with FA3 + THD (varlen) Apr 24, 2026

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

c476f15

…_attn_pad_bw_seqs

cyanguwa previously approved these changes Apr 25, 2026

View reviewed changes

sudhakarsingh27 dismissed cyanguwa’s stale review via a2b0f1b April 25, 2026 05:42

sudhakarsingh27 added 2.16.0 and removed 2.15.0 labels Apr 27, 2026

sudhakarsingh27 and others added 5 commits April 28, 2026 11:54

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

b94e175

…_attn_pad_bw_seqs

skip tests which OOM in deterministic+backward+hopper+large_configs a…

fc9182f

…s its a known cudnn issue Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

636666f

…_attn_pad_bw_seqs

[pre-commit.ci] auto fixes from pre-commit.com hooks

7928bc9

for more information, see https://pre-commit.ci

Merge branch 'flash_attn_pad_bw_seqs' of github.com:sudhakarsingh27/T…

1585ebb

…ransformerEngine into flash_attn_pad_bw_seqs

make cp det and nondet tests run in parallel whenever possible

2464f43

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' into flash_attn_pad_bw_seqs

789ccf0

sudhakarsingh27 added 2 commits May 4, 2026 12:03

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

0a32185

…_attn_pad_bw_seqs

Merge branch 'flash_attn_pad_bw_seqs' of github.com:sudhakarsingh27/T…

c33cf2d

…ransformerEngine into flash_attn_pad_bw_seqs

sudhakarsingh27 added 4 commits May 5, 2026 15:06

b200 shouldnt run FA3 even if present

7b8ca1e

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

e02b658

…_attn_pad_bw_seqs

sudhakarsingh27 added 2 commits May 8, 2026 23:21

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into flash…

9389309

…_attn_pad_bw_seqs

Conversation

sudhakarsingh27 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TLDR

Problem

Solution

Type of change

Changes

context_parallel.py

backends.py

dot_product_attention.py

utils.py

test_attention_with_cp.py

run_attention_with_cp.py

test_attention.py

New Tests

CP tests (test_attention_with_cp.py)

Non-CP tests (test_attention.py)

Checklist:

Uh oh!

greptile-apps Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

cyanguwa Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sudhakarsingh27 commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Apr 8, 2026

Uh oh!

sudhakarsingh27 commented Apr 9, 2026

Uh oh!

sudhakarsingh27 commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyanguwa Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cyanguwa Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Apr 24, 2026

Uh oh!

sudhakarsingh27 commented Apr 25, 2026

Uh oh!

sudhakarsingh27 commented Apr 29, 2026

Uh oh!

sudhakarsingh27 commented Apr 30, 2026

sudhakarsingh27 commented Jan 14, 2026 •

edited

Loading

CP tests (`test_attention_with_cp.py`)

Non-CP tests (`test_attention.py`)

greptile-apps Bot commented Jan 14, 2026 •

edited

Loading