Fused Adam Support for MXFP8 + FSDP2 integration by vthumbe1503 · Pull Request #2780 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-03-18T16:34:37Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-03-18T16:36:37Z

/te-ci L1 pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps · 2026-03-18T16:41:37Z

Greptile Summary

This PR adds a fused Adam optimizer kernel for MXFP8 (MX block-scaling FP8) model weights, integrating it into the existing FusedAdam optimizer path and enabling FSDP2-compatible distributed training. The implementation follows the established multi-tensor-apply pattern but introduces a new tile-based dispatch (multi_tensor_apply_mxfp8) because MXFP8 scaling is inherently 2-D (32×32 tiles), unlike the 1-D chunk-based approach used for all existing Adam variants.

Key changes:

New CUDA kernel (adam_mxfp8_fused_kernel): fuses Adam update, per-tile absmax accumulation, scale-inverse computation (e8m0), and rowwise + colwise FP8 quantisation into a single kernel, avoiding a separate cast pass.
New tiling scheduler (multi_tensor_apply_mxfp8): batches up to 320 blocks / 24 tensors per launch and handles mid-tensor flush when the block budget is exhausted.
Python optimizer integration: adds the MXFP8 branch in FusedAdam.step() with guards for capturable=True and master_weights=False; refactors shared helpers (compute_bias_correction, check_tensor_list_sizes, requires_64bit_indexing) for reuse across the FP8 and MXFP8 paths.
C++ test: exercises E4M3 and E5M2 paths with 25 tensors (intentionally exceeding MXFP8_MAX_TENSORS=24) to validate the chunking logic.

Issues found:

out_dtype shared between FP8 and MXFP8 kernel calls (fused_adam.py lines 836/848): if a parameter group contains both Float8Tensor and MXFP8Tensor parameters, a single out_dtype variable is overwritten by whichever type is processed last in the loop, potentially passing the wrong element dtype to one of the two kernels.
rows/cols stored as int in MXFP8TensorListMetadata (multi_tensor_apply.cuh): the 32-bit fields can silently overflow for tensors whose individual dimensions exceed INT_MAX, despite the outer launch code correctly selecting a 64-bit index path for such tensors.

Confidence Score: 3/5

Safe to merge for the common single-dtype use-case, but has two correctness gaps worth fixing before wider adoption.
The core CUDA kernel logic, tile-scheduling, and Python guard logic are all sound and well-tested. However, two issues reduce confidence: (1) out_dtype can silently carry the wrong dtype to either kernel when a parameter group mixes Float8Tensor and MXFP8Tensor — a silent correctness bug that produces wrong quantisation with no error; (2) int rows/cols in MXFP8TensorListMetadata can overflow for very large tensors despite the 64-bit indexing path being selected upstream. The PR description is also left as a template with unchecked checklist items, and there are no added Python-level tests for the MXFP8 optimizer path.
transformer_engine/pytorch/optimizers/fused_adam.py (out_dtype sharing) and transformer_engine/common/multi_tensor/multi_tensor_apply.cuh (int overflow in metadata struct).

Important Files Changed

Filename	Overview
transformer_engine/common/multi_tensor/adam.cu	Adds the MXFP8 fused Adam kernel and launch wrapper; refactors shared helpers (bias-correction, tensor-list size check, 64-bit index detection). The new `adam_mxfp8_fused_kernel` is logically sound, but `compute_bias_correction` uses `std::pow` on an `int` step which can have precision edge-cases.
transformer_engine/common/multi_tensor/multi_tensor_apply.cuh	Adds `MXFP8TensorListMetadata` and `multi_tensor_apply_mxfp8`; the tile-based chunking logic mirrors the existing chunk-based path. The `rows`/`cols` fields in the metadata struct are typed as `int`, which can overflow for tensors whose individual dimensions exceed INT_MAX, despite the outer code selecting a 64-bit indexing path.
transformer_engine/pytorch/optimizers/fused_adam.py	Adds MXFP8 parameter handling with proper guards (capturable=False, master_weights=True, rowwise+colwise data present). However, the single shared `out_dtype` variable is written by both the Float8Tensor and MXFP8Tensor branches; in a mixed-type parameter group the wrong dtype could be forwarded to either kernel.
transformer_engine/pytorch/csrc/extensions/multi_tensor/adam.cpp	Adds the `multi_tensor_adam_mxfp8_cuda` PyTorch extension wrapper; straightforward ATen-to-TE tensor conversion with a sensible early validation of the 8-list requirement.
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Registers `multi_tensor_adam_mxfp8` pybind11 binding; includes `py::call_guard<py::gil_scoped_release>()` consistent with all other Adam bindings.
transformer_engine/common/include/transformer_engine/multi_tensor.h	Adds the public C API declaration `nvte_multi_tensor_adam_mxfp8_cuda`; documentation accurately describes the 8-list convention and parameter semantics.
tests/cpp/operator/test_multi_tensor_adam_mxfp8.cu	New C++ test covering E4M3 and E5M2 with 25 tensors (> MXFP8_MAX_TENSORS=24) to exercise the chunking path; validates updated FP32 params, moments, MXFP8 quantized data, and scale-inverses against a reference run.
tests/pytorch/distributed/run_fsdp2_fused_adam.py	Adds per-step loss logging via a new `dist_print` helper. The helper depends on a module-level `LOCAL_RANK` variable that is only initialised in one test function, making it silently a no-op if called from other contexts.
tests/cpp/test_common.h	Adds `rowwise_scale_inv_dptr` and `columnwise_scale_inv_dptr` accessors to the test `Tensor` helper class; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant PY as FusedAdam.step() (Python)
    participant EXT as adam.cpp (PyTorch ext)
    participant CU as adam.cu (CUDA)
    participant APPLY as multi_tensor_apply_mxfp8
    participant KERNEL as adam_mxfp8_fused_kernel

    PY->>PY: per-param loop — classify params<br/>(Float8/MXFP8/F16/F32)
    PY->>PY: accumulate into p_mxfp8_rowwise,<br/>p_mxfp8_colwise, moments, master_param

    PY->>EXT: multi_tensor_adam_mxfp8(chunk_size,<br/>noop_flag, 8 tensor lists, …, fp8_dtype)
    EXT->>EXT: makeTransformerEngineTensorList()<br/>validate num_lists == 8
    EXT->>CU: nvte_multi_tensor_adam_mxfp8_cuda(…)
    CU->>CU: compute_bias_correction()<br/>check_tensor_list_sizes()<br/>dtype validation
    CU->>APPLY: multi_tensor_apply_mxfp8<kernel>(…)

    loop For each tensor (batched ≤ MXFP8_MAX_TENSORS=24 tensors,<br/>≤ MXFP8_MAX_BLOCKS=320 blocks per launch)
        APPLY->>APPLY: build MXFP8TensorListMetadata<br/>(block_to_tensor, block_to_tile, rows, cols)
        APPLY->>KERNEL: Kernel<<<blocks, 256>>>(tl, β1, β2, ε, lr, …)
        KERNEL->>KERNEL: Stage 4: Adam update → p/m/v (FP32)
        KERNEL->>KERNEL: Stage 5: atomicMaxFloat → row/col amax (shared mem)
        KERNEL->>KERNEL: Stage 6: write rowwise & colwise scale-inv (e8m0)
        KERNEL->>KERNEL: Stage 7: quantise p → MXFP8 rowwise + colwise data
    end

    CU-->>PY: return (master params, moments, MXFP8 data, scales updated in-place)

_{Last reviewed commit: "address review comme..."}

transformer_engine/pytorch/csrc/extensions/pybind.cpp

transformer_engine/pytorch/optimizers/fused_adam.py

transformer_engine/common/multi_tensor/multi_tensor_apply.cuh

…rmerEngine into fused_adam_for_mxfp8

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…rmerEngine into fused_adam_for_mxfp8

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-03-19T18:25:27Z

Need more perf tuning for mxfp8, Will make the PR active after the desired perf is achieved.

Oleg-Goncharov · 2026-03-20T11:38:08Z

transformer_engine/common/multi_tensor/multi_tensor_apply.cuh

+void multi_tensor_apply_mxfp8(int64_t chunk_size, const transformer_engine::Tensor &noop_flag,
+                              std::vector<std::vector<transformer_engine::Tensor *>> tensor_lists,
+                              uint8_t fp8_dtype, cudaStream_t stream, ArgTypes... args) {
+  constexpr size_t kNumTensorLists = 8;


Let’s move this parameter above struct MXFP8TensorListMetadata so we can use it to size the array. That way we can replace void* addresses[8][MXFP8_MAX_TENSORS]; with void* addresses[kNumTensorLists][MXFP8_MAX_TENSORS];

Oleg-Goncharov · 2026-03-20T11:41:34Z

transformer_engine/common/multi_tensor/multi_tensor_apply.cuh

+                              uint8_t fp8_dtype, cudaStream_t stream, ArgTypes... args) {
+  constexpr size_t kNumTensorLists = 8;
+  NVTE_CHECK(tensor_lists.size() == kNumTensorLists,
+             "Expected 8 tensor lists for MXFP8, but found ", tensor_lists.size());


Here we hard-coded the tensor-list size as 8. Let’s use kNumTensorLists instead

Oleg-Goncharov · 2026-03-20T11:46:09Z

transformer_engine/common/multi_tensor/adam.cu

+    const ::transformer_engine::e8m0_t row_biased =
+        reinterpret_cast<const ::transformer_engine::e8m0_t &>(row_raw);
+    const float row_scale_inv = transformer_engine::ptx::exp2f_rcp(row_biased);
+    if (dtype == static_cast<uint8_t>(transformer_engine::DType::kFloat8E4M3)) {


To improve performance, it may be worth adding function template parameters (e.g., IType, OType), as we do in other kernels, to avoid runtime branching.
Oh, I see. OType may vary across tensors within a multi-tensor call, so it’s a runtime attribute, right?

Oleg-Goncharov · 2026-03-20T11:51:00Z

transformer_engine/common/multi_tensor/adam.cu

+  return static_cast<FP8_T>(x);
+}
+
+__device__ __forceinline__ float fp8_max_norm_rcp(uint8_t fp8_dtype) {


Additionally, if we add OType, we can drop this helper and use transformer_engine::Quantized_Limits<OType>::max_norm_rcp directly

vthumbe1503 and others added 2 commits March 18, 2026 16:29

fix merge conflicts, now things working

e355c38

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b733fc

for more information, see https://pre-commit.ci

vthumbe1503 marked this pull request as ready for review March 18, 2026 16:36

vthumbe1503 requested review from Oleg-Goncharov and ptrendx March 18, 2026 16:37

xfail isnt fixe yet

175e43e

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/pybind.cpp Outdated Show resolved Hide resolved

transformer_engine/pytorch/optimizers/fused_adam.py Show resolved Hide resolved

transformer_engine/common/multi_tensor/multi_tensor_apply.cuh Show resolved Hide resolved

vthumbe1503 and others added 5 commits March 18, 2026 16:42

Merge branch 'fused_adam_for_mxfp8' of github.com:vthumbe1503/Transfo…

c9c85c6

…rmerEngine into fused_adam_for_mxfp8

Apply suggestion from @greptile-apps[bot]

2ac1629

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

address review comment

6b916f5

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'fused_adam_for_mxfp8' of github.com:vthumbe1503/Transfo…

ebe9585

…rmerEngine into fused_adam_for_mxfp8

address review comments

31d0aa5

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 marked this pull request as draft March 19, 2026 18:24

Oleg-Goncharov reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Adam Support for MXFP8 + FSDP2 integration#2780

Fused Adam Support for MXFP8 + FSDP2 integration#2780
vthumbe1503 wants to merge 8 commits intoNVIDIA:mainfrom
vthumbe1503:fused_adam_for_mxfp8

vthumbe1503 commented Mar 18, 2026

Uh oh!

vthumbe1503 commented Mar 18, 2026

Uh oh!

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Mar 19, 2026

Uh oh!

Oleg-Goncharov Mar 20, 2026

Uh oh!

Oleg-Goncharov Mar 20, 2026

Uh oh!

Oleg-Goncharov Mar 20, 2026 •

edited

Loading

Uh oh!

Oleg-Goncharov Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vthumbe1503 commented Mar 18, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

vthumbe1503 commented Mar 18, 2026

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Mar 19, 2026

Uh oh!

Oleg-Goncharov Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

Oleg-Goncharov Mar 20, 2026 •

edited

Loading