[Common, pyTorch] Grouped MXFP8 dequantize support by ptrendx · Pull Request #2722 · NVIDIA/TransformerEngine

ptrendx · 2026-03-02T19:13:32Z

Description

Support dequantization for MXFP8 grouped tensors.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Grouped dequantization kernel for MXFP8
Exposed the functionality in PyTorch

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-10T18:06:21Z

Greptile Summary

This PR adds grouped MXFP8 dequantization support, introducing a new Blackwell-only TMA-based CUDA kernel (group_dequantize_mxfp8_kernel), a C API entry point (nvte_group_dequantize), and a PyTorch extension (group_dequantize). The kernel correctly handles all four ShapeRepresentation modes (SAME_BOTH_DIMS, VARYING_FIRST_DIM, VARYING_LAST_DIM, VARYING_BOTH_DIMS) using double-buffered TMA loads and per-tensor scale-offset arithmetic that is mathematically consistent with the quantize path.

Key findings:

Scale offset arithmetic verified: The padded_rows / SCALE_DIM_X formula in the VARYING_LAST_DIM scale base-offset computation is provably equivalent to DIVUP_TO_MULTIPLE(DIVUP(M, SCALE_DIM_X), alignment) for all valid row counts, so the dequantize and quantize scale layouts are consistent.
Output metadata incomplete for VARYING_LAST_DIM / VARYING_BOTH_DIMS (cast.cpp): NoneQuantizer::create_grouped_tensor always sets last_dims = None and computes tensor_offsets from first_dims × logical_last_dim, which is wrong for VARYING_BOTH_DIMS (the total element count is not the per-tensor column count). Downstream code that needs to iterate individual tensors in the output cannot do so reliably.
Type mismatch in C++ test (test_dequantize_mxfp8_grouped.cu): size_t * device pointers are declared as kNVTEInt64 in NVTEBasicTensor structs; the kernel reinterprets them as int64_t *.
Device memory leaks on test failures: Early ASSERT_* returns bypass cudaFree calls.
cp_async_bulk_wait_group_read<0>() called by all threads without the is_master_thread guard, inconsistent with the guarded pattern used throughout the rest of the kernel (harmless but confusing).

Confidence Score: 3/5

The core CUDA kernel logic is sound but the PyTorch extension produces an incomplete output GroupedTensor for VARYING_LAST_DIM/VARYING_BOTH_DIMS inputs; merge after addressing the metadata propagation issue.
The kernel algorithm and scale arithmetic are correct and well-tested. The main concern is the PyTorch-layer output GroupedTensor missing last_dims and having incorrect tensor_offsets for VARYING_LAST_DIM and VARYING_BOTH_DIMS cases — this is a real functional gap for API consumers. The test type mismatch and memory leaks are secondary. Score of 3 reflects a new feature that works correctly at the kernel level but has a meaningful metadata gap in the Python API layer.
transformer_engine/pytorch/csrc/extensions/cast.cpp (missing last_dims/tensor_offsets propagation to output) and tests/cpp/operator/test_dequantize_mxfp8_grouped.cu (size_t/int64_t type mismatch and memory leaks on failure).

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/group_dequantize_mxfp8.cuh	New CUDA kernel implementing grouped MXFP8 dequantization using Blackwell TMA for double-buffered global↔shared transfers. Scale offset arithmetic for VARYING_LAST_DIM is correct (equivalence of padded_rows/SCALE_DIM_X and DIVUP_TO_MULTIPLE(rows/SCALE_DIM_X, alignment) is mathematically verified), but `cp_async_bulk_wait_group_read<0>()` is unnecessarily called by all 128 threads outside the master-thread guard.
transformer_engine/common/cast/dispatch/dequantize.cuh	Clean dispatcher adding `group_dequantize_helper` routing MXFP8 grouped tensors to the new kernel; correctly gates on CC ≥ 10.0 and delegates all other scaling modes to an NVTE_ERROR.
transformer_engine/common/cast/cast.cu	New `nvte_group_dequantize` C API entry point follows the same pattern as `nvte_group_quantize`; straightforward and correct.
transformer_engine/pytorch/csrc/extensions/cast.cpp	New `group_dequantize` PyTorch extension correctly builds the input GroupedTensorWrapper and calls `nvte_group_dequantize`, but the output GroupedTensor created by `NoneQuantizer::create_grouped_tensor` is missing `last_dims` and has incorrect/missing `tensor_offsets` for VARYING_LAST_DIM and VARYING_BOTH_DIMS inputs, which can break downstream per-tensor iteration.
tests/cpp/operator/test_dequantize_mxfp8_grouped.cu	Thorough bitwise-correctness test comparing grouped vs. per-tensor single dequantize, but has three issues: (1) `size_t *` device pointers declared as `kNVTEInt64` — a type mismatch; (2) device memory leaks on early ASSERT_EQ failures inside the per-tensor loop; (3) the already-flagged off-by-one in `offsets_shape.data[0]`.

Sequence Diagram

sequenceDiagram
    participant PY as Python caller
    participant EXT as PyTorch extension<br/>(cast.cpp)
    participant CAPI as C API<br/>(cast.cu)
    participant DISP as Dispatcher<br/>(dequantize.cuh)
    participant TMA as update_tma_descriptors<br/>(CUDA kernel)
    participant KERN as group_dequantize_mxfp8_kernel<br/>(CUDA kernel)

    PY->>EXT: group_dequantize(input, otype)
    EXT->>EXT: build input GroupedTensorWrapper<br/>(rowwise/colwise data + scales +<br/>first_dims / last_dims / offsets)
    EXT->>EXT: NoneQuantizer::create_grouped_tensor()<br/>allocate output buffer
    EXT->>CAPI: nvte_group_dequantize(input, output, stream)
    CAPI->>DISP: group_dequantize_helper(input, output, stream)
    DISP->>DISP: check CC ≥ 10.0 (Blackwell)
    DISP->>DISP: mxfp8::group_dequantize(&input, output, stream)

    alt is_single_tensor (SAME_BOTH_DIMS or VARYING_FIRST_DIM)
        DISP->>KERN: launch with static TMA descriptors
    else multi-tensor (VARYING_LAST_DIM or VARYING_BOTH_DIMS)
        DISP->>TMA: update_tma_descriptors<<<num_tensors, 32>>><br/>per-tensor TMA descriptor → g_tensor_maps[]
        TMA-->>DISP: (async, same stream)
        DISP->>KERN: launch with per-tensor g_tensor_maps[]
    end

    KERN->>KERN: fence_acquire_tensormap (if multi-tensor)
    KERN->>KERN: get_current_tensor_id() — binary search on offsets_ptr
    KERN->>KERN: compute scales_base_offset per tensor
    loop ITERATIONS=8 (double-buffered, BUFFER_DIM_Y=16)
        KERN->>KERN: TMA load in_sh[buff] ← global FP8 data
        KERN->>KERN: read e8m0 scale → block_scale = exp2(biased_exp)
        KERN->>KERN: out = block_scale × float(in) → OType
        KERN->>KERN: TMA store out_sh[buff] → global output
    end
    KERN-->>EXT: (async kernel, CUDA stream)
    EXT-->>PY: return output GroupedTensor

_{Last reviewed commit: "Merge remote-trackin..."}

greptile-apps · 2026-03-10T18:06:26Z

tests/cpp/operator/test_dequantize_mxfp8_grouped.cu

+    nvte_set_grouped_tensor_param(in_group_tensor,
+                                  NVTEGroupedTensorParam::kNVTEGroupedRowwiseData, &in_data_tensor,
+                                  sizeof(in_data_tensor));
+  } else {


Incorrect offsets shape — off-by-one

offsets_shape.data[0] is set to num_tensors, but the offsets array is a standard CSR-style sentinel array with num_tensors + 1 entries (the last entry stores the total element count). The allocation uses (num_tensors + 1) * sizeof(size_t) on line 132 and offsets_h is declared with num_tensors + 1 on line 408. get_current_tensor_id (borrowed from the quantize path) searches over offsets_ptr[0 .. num_tensors], so it will access one element past the declared shape.

Suggested change

} else {

offsets_shape.data[0] = num_tensors + 1;

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps · 2026-03-21T00:01:53Z

tests/cpp/operator/test_dequantize_mxfp8_grouped.cu

+  size_t *first_dims_d;
+  size_t *last_dims_d;
+  size_t *offsets_d;
+
+  cudaMalloc((void **)&in_data_d, in_data_size);
+  cudaMalloc((void **)&out_grouped_d, out_data_size);
+  cudaMalloc((void **)&in_scales_d, scales_size);
+  cudaMalloc((void **)&first_dims_d, first_dims_size);
+  cudaMalloc((void **)&last_dims_d, last_dims_size);
+  cudaMalloc((void **)&offsets_d, offsets_size);
+
+  cudaMemcpy(in_data_d, in_data_h.data(), in_data_size, cudaMemcpyHostToDevice);
+  cudaMemcpy(in_scales_d, in_scales_h.data(), scales_size, cudaMemcpyHostToDevice);
+  cudaMemcpy(first_dims_d, first_dims_h.data(), first_dims_size, cudaMemcpyHostToDevice);
+  cudaMemcpy(last_dims_d, last_dims_h.data(), last_dims_size, cudaMemcpyHostToDevice);
+  cudaMemcpy(offsets_d, offsets_h.data(), offsets_size, cudaMemcpyHostToDevice);


size_t * device pointers declared as kNVTEInt64

first_dims_d, last_dims_d, and offsets_d are allocated as size_t * (unsigned 64-bit), but passed to NVTEBasicTensor with type kNVTEInt64 (signed 64-bit). The kernel then reinterpret-casts them to int64_t *. While this works on 64-bit Linux platforms where the layouts are identical and values are non-negative, it is technically undefined behavior and will silently miscompute on any platform where sizeof(size_t) != sizeof(int64_t).

The pointers should be declared as int64_t * to match the declared tensor type, keeping the host-side std::vector<size_t> but performing an explicit cast when copying:

int64_t *first_dims_d; int64_t *last_dims_d; int64_t *offsets_d;

And correspondingly cast the host-side vectors on copy, or change the host vectors to std::vector<int64_t> as well.

greptile-apps · 2026-03-21T00:01:54Z

tests/cpp/operator/test_dequantize_mxfp8_grouped.cu

+    TensorWrapper output_w;
+    output_w.set_rowwise_data(single_out_d, otype, single_shape);
+
+    nvte_dequantize(input_w.data(), output_w.data(), 0);
+    cudaDeviceSynchronize();
+    err = cudaGetLastError();
+    ASSERT_EQ(err, cudaSuccess) << "Single-tensor dequantize failed for tensor " << t << ": "
+                                << cudaGetErrorString(err);
+
+    // Copy reference output to host
+    cudaMemcpy(out_ref_h.data() + data_offset, single_out_d, single_out_size,
+               cudaMemcpyDeviceToHost);
+
+    cudaFree(single_in_d);
+    cudaFree(single_out_d);
+    cudaFree(single_scales_d);
+  }


Device memory leak on early assertion failures

When ASSERT_EQ(err, cudaSuccess) (line 292) fires and terminates the test early, the per-iteration device allocations (single_in_d, single_out_d, single_scales_d) are leaked. Furthermore, the outer-scope allocations (in_data_d, out_grouped_d, in_scales_d, etc.) are also never freed because ASSERT_* macros in GTest cause an early return, bypassing the cudaFree calls at the end of performTest.

Consider wrapping device allocations in RAII handles (e.g., a small CudaPtr<T> wrapper) or using EXPECT_EQ followed by explicit cleanup so memory is always released, even in failure paths.

greptile-apps · 2026-03-21T00:01:55Z

transformer_engine/pytorch/csrc/extensions/cast.cpp

+  // Create output GroupedTensor using NoneQuantizer.
+  NoneQuantizer q{py::none()};
+  auto [out_cpp, out_py] = q.create_grouped_tensor(num_tensors, logical_shape, otype, py::none(),
+                                                   first_dims, logical_first_dim, logical_last_dim);
+
+  NVTE_SCOPED_GIL_RELEASE({
+    nvte_group_dequantize(input_cpp.data(), out_cpp.data(), at::cuda::getCurrentCUDAStream());
+  });


last_dims and tensor_offsets not propagated to the output GroupedTensor

NoneQuantizer::create_grouped_tensor always sets kwargs["last_dims"] = py::none() and computes tensor_offsets only from first_dims (via build_grouped_tensor_offsets). This means:

For VARYING_LAST_DIM inputs (where first_dims is None): the output's tensor_offsets will be None, making it impossible for downstream code to locate individual tensors in the output buffer.

For VARYING_BOTH_DIMS inputs (where first_dims is present but last_dims also varies): build_grouped_tensor_offsets computes offsets as first_dims[i] * logical_last_dim, where logical_last_dim is the total element count — producing completely wrong per-tensor byte boundaries on the output object.

The input's last_dims and tensor_offsets tensors are extracted at lines 279 and 278 but are only used to populate input_cpp; they are never forwarded to the output. Downstream consumers that need to split or index into the output GroupedTensor will silently get wrong results.

Consider either passing last_dims and tensor_offsets through to the output, or documenting clearly that callers are responsible for preserving this metadata from the input.

greptile-apps · 2026-03-21T00:01:56Z

transformer_engine/common/cast/mxfp8/group_dequantize_mxfp8.cuh

+  ptx::cp_async_bulk_wait_group_read<0>();
+  __syncthreads();


cp_async_bulk_wait_group_read<0>() called by all threads without is_master_thread guard

All 128 threads execute ptx::cp_async_bulk_wait_group_read<0>(), but only the master thread ever issued TMA bulk-store operations. Non-master threads have no outstanding bulk-copy groups, so wait_group<0> is effectively a no-op for them. While harmless today (the PTX instruction is well-defined with zero outstanding groups), this is inconsistent with the guarded pattern used everywhere else in the kernel (e.g., lines 315–324) and may confuse future readers.

Suggested change

ptx::cp_async_bulk_wait_group_read<0>();

__syncthreads();

if (is_master_thread) {

ptx::cp_async_bulk_wait_group_read<0>();

}

__syncthreads();

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

ptrendx added 4 commits February 26, 2026 23:19

Grouped dequantize for MXFP8

95995ef

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge branch 'main' into pr_grouped_dequantize

884a312

Pytorch extension

c56cbf6

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix CUDA graphs compatibility

f502a24

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx requested a review from Oleg-Goncharov March 2, 2026 19:13

pre-commit-ci bot and others added 3 commits March 2, 2026 19:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

cafb9d5

for more information, see https://pre-commit.ci

Handling non-full tiles

de132e1

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3be5fcc

for more information, see https://pre-commit.ci

ptrendx linked an issue Mar 2, 2026 that may be closed by this pull request

Dequantization support for the grouped tensor - MXFP8 #2725

Open

ptrendx added 3 commits March 3, 2026 13:46

Fix

7fafb6d

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fixes

ba1ccfe

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge branch 'main' into pr_grouped_dequantize

421441e

ptrendx marked this pull request as ready for review March 10, 2026 18:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

702adc5

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

ptrendx added 3 commits March 19, 2026 13:02

Merge remote-tracking branch 'origin/main' into pr_grouped_dequantize

57b6ae0

Fixes from review

dc89039

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge remote-tracking branch 'origin/main' into pr_grouped_dequantize

c0bd9e5

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common, pyTorch] Grouped MXFP8 dequantize support#2722

[Common, pyTorch] Grouped MXFP8 dequantize support#2722
ptrendx wants to merge 14 commits intoNVIDIA:mainfrom
ptrendx:pr_grouped_dequantize

ptrendx commented Mar 2, 2026

Uh oh!

greptile-apps bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 10, 2026

Uh oh!

greptile-apps bot Mar 21, 2026

Uh oh!

greptile-apps bot Mar 21, 2026

Uh oh!

greptile-apps bot Mar 21, 2026

Uh oh!

greptile-apps bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptrendx commented Mar 2, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 10, 2026 •

edited

Loading