Feature/unswizzle by int-smart · Pull Request #2732 · NVIDIA/TransformerEngine

int-smart · 2026-03-04T05:09:04Z

Description

This PR adds unswizzle support for scaling factors and extends the swizzle module so scaling tensors can be converted from GEMM-swizzled layout back to compact layout, including multi-tensor paths. It also adds round-trip and standalone tests to validate unswizzle correctness.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added unswizzle APIs and implementation in transformer_engine/common/swizzle/swizzle.cu and declarations in transformer_engine/common/include/transformer_engine/swizzle.h
Added multi-tensor unswizzle support with swizzle-like validation assumptions (homogeneous scaling mode/layout, swizzled input and compact output expectations)
Refactored multi-tensor unswizzle launch/kernels to mirror swizzle structure (split row-wise and column-wise kernels) for easier readability
Added/extended tests in tests/cpp/operator/test_swizzle.cu, including standalone unswizzle and swizzle→unswizzle round-trip coverage

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

- Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format. - Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels. - Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively. These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format. - Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes. - Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization. - Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format. - Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality. - Updated the launch function to handle multiple tensor unswizzling operations efficiently. These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-04T05:17:43Z

Greptile Summary

This PR adds nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors to convert GEMM-swizzled scaling factors back to the compact row-major layout, mirroring the existing swizzle APIs. New GPU kernels (unswizzle_row_scaling_kernel_impl, unswizzle_col_scaling_kernel_impl) invert the tile-shuffle logic used by their swizzle counterparts, and the multi-tensor variants re-use the MultiSwizzleArgs batching infrastructure. A round-trip test and standalone unswizzle tests are added to test_swizzle.cu, with coverage of padded (non-aligned) shapes.

Issues found:

Missing has_data() guard for input in multi-tensor columnwise path (multi_tensor_unswizzle_scaling_factors): The columnwise block accesses input[i]->columnwise_scale_inv.shape and .dptr without first calling NVTE_CHECK(input[i]->columnwise_scale_inv.has_data(), ...), unlike the rowwise path which does guard. A caller with a mis-configured tensor will hit a null-pointer dereference or silent garbage kernel launch.
num_tensors derived from output.size() instead of input.size(): If the internal function is called directly with input.size() < output.size(), iterating to output.size() will OOB-index input[i]. The symmetric multi_tensor_swizzle_scaling_factors uses input.size().
Dual-scale tensors rejected in single-tensor unswizzle_scaling_factors: The NVTE_CHECK(!has_rowwise || !has_columnwise) guard prevents unswizzling tensors with both scale types, even though the implementation already has independent rowwise and columnwise kernel paths. The same restriction exists in swizzle_scaling_factors, meaning the public round-trip (swizzle → unswizzle) will throw for any dual-path tensor.
Leftover copy-paste comment at line 1225: // Example for NVFP4 rowwise path: was transcribed verbatim from a reviewer's code suggestion and should be removed.

Confidence Score: 2/5

Not safe to merge — several correctness issues remain in the multi-tensor unswizzle path and the single-tensor API rejects valid dual-scale tensors.
The kernel logic itself (register shuffle/unshuffle, SLM tiling) appears to be the correct inverse of the swizzle kernels and the test suite validates it for both aligned and padded shapes. However, the multi-tensor host-side validation has a missing has_data() check that will produce null-pointer UB for real-world inputs, num_tensors is taken from the wrong vector (potential OOB), and the single-tensor API still has an artificial restriction that breaks the swizzle→unswizzle round-trip for dual-scale tensors — a common production configuration.
transformer_engine/common/swizzle/swizzle.cu — specifically the multi_tensor_unswizzle_scaling_factors host-side validation loop and the NVTE_CHECK(!has_rowwise || !has_columnwise) guard in unswizzle_scaling_factors.

Important Files Changed

Filename	Overview
transformer_engine/common/swizzle/swizzle.cu	Core implementation file: adds `unswizzle_scaling_factors`, `multi_tensor_unswizzle_scaling_factors`, and supporting GPU kernels (`unswizzle_row_scaling_kernel_impl`, `unswizzle_col_scaling_kernel_impl`, `multi_tensor_unswizzle_row/col_scaling_kernel`, `launch_multi_tensor_unswizzle_scaling_factors`). Several issues remain: `num_tensors` derived from `output.size()` instead of `input.size()` (potential OOB), missing `has_data()` guard for input columnwise factors in the multi-tensor columnwise path, a leftover copy-paste comment from a reviewer suggestion, and the single-tensor path still rejects dual-scale tensors despite having independent kernel paths for each scale type.
transformer_engine/common/include/transformer_engine/swizzle.h	Public API header: adds `nvte_unswizzle_scaling_factors` and `nvte_multi_tensor_unswizzle_scaling_factors` declarations with doc-comments. Signatures, parameter documentation, and requirements are complete and consistent with the swizzle counterparts.
tests/cpp/operator/test_swizzle.cu	Test file: adds `compute_ref_unswizzle`, `performTestUnswizzle1D`, `performTestSwizzleUnswizzleRoundtrip`, and their GTest instantiations. The new test functions correctly guard against uninitialized `SF_MODE_X`/`SF_MODE_Y` (the UB in the pre-existing `performTestSwizzle1D` is not introduced here). Padded test shapes are included and the roundtrip test zero-fills the padded region before comparing. No new critical issues found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["nvte_unswizzle_scaling_factors(input, output, stream)"] --> B["unswizzle_scaling_factors()"]
    B --> C{scaling_mode?}
    C -->|MXFP8| D{has_rowwise?}
    C -->|NVFP4| E{has_rowwise?}
    D -->|yes| F["m,k ← output.scale_inv.shape\nInput size check\nrowwise=true"]
    D -->|no| G["m,k ← output.col_scale_inv.shape\nInput size check\nrowwise=false"]
    E -->|yes| H["m,k ← output.scale_inv.shape\nrowwise=true"]
    E -->|no| I["m,k ← output.col_scale_inv.shape\nrowwise=true"]
    F --> J["NVTE_CHECK m%128==0 & k%4==0"]
    G --> J
    H --> J
    I --> J
    J --> K["launch_unswizzle(vec_size, grid, slm)"]
    K --> L{rowwise?}
    L -->|yes| M["unswizzle_row_scaling_kernel_impl\n(load swizzled tiles → SLM\n→ regs_unshuffle → compact output)"]
    L -->|no| N["unswizzle_col_scaling_kernel_impl\n(load swizzled tiles → SLM\n→ regs_unshuffle_with_bit_shifts\n→ col-major compact output)"]

    O["nvte_multi_tensor_unswizzle_scaling_factors(inputs, outputs, n, stream)"] --> P["multi_tensor_unswizzle_scaling_factors()"]
    P --> Q{rowwise_unswizzle?}
    Q -->|yes| R["batch tensors into MultiSwizzleArgs\nmulti_tensor_unswizzle_row_scaling_kernel"]
    P --> S{columnwise_unswizzle?}
    S -->|yes| T["batch tensors into MultiSwizzleArgs\nmulti_tensor_unswizzle_col_scaling_kernel"]

_{Last reviewed commit: "Typo"}

tests/cpp/operator/test_swizzle.cu

transformer_engine/common/swizzle/swizzle.cu

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

transformer_engine/common/swizzle/swizzle.cu

tests/cpp/operator/test_swizzle.cu

transformer_engine/common/swizzle/swizzle.cu

…ather than casting Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

transformer_engine/common/swizzle/swizzle.cu

ptrendx · 2026-03-11T00:14:10Z

@int-smart Please address the comments from Greptile and ideally also add the test case with the input not already padded to 128,128.

int-smart · 2026-03-12T02:43:38Z

@ptrendx Will look into these

int-smart · 2026-03-12T23:15:36Z

@ptrendx From what I am understanding then, there is no relevance of padding to the unswizzle kernel. Since the padding is already done during the swizzling operation I can just mirror it back to compact layout with the zero pads correctly in the compact layout and that should do. Is that assumption correct. Initially I was thinking of removing the padding from the scale_inv itself since this would be used for checkpointing

- Updated unswizzling kernel implementations to remove original_M and original_K parameters, simplifying the function signatures. - Enhanced test suite to utilize new unswizzling data shapes, ensuring comprehensive coverage of aligned and padded cases. These changes improve the clarity and efficiency of the unswizzling process in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

transformer_engine/common/swizzle/swizzle.cu

ptrendx · 2026-03-16T17:46:34Z

@int-smart I'm not sure I follow, I think that what you are saying is probably correct, but let me try to clarify just in case:

the scaling factors, irrespective of the compact or gemm-ready layout, are zero-padded to the multiple of [128,4] (or the transpose in case of compact and columnwise).
So for the unswizzle, you should just use the same size of the output unswizzled tensor as the original swizzled one. You don't even need to zero it before unswizzling, since the swizzled tensor already has 0s in the right places so unswizzling it will put 0s in the pad positions.

int-smart · 2026-03-16T17:48:00Z

@ptrendx Makes sense. I added that in the last commit.

ptrendx · 2026-03-16T22:41:59Z

transformer_engine/common/swizzle/swizzle.cu

+  switch (scaling_mode) {
+    case NVTE_MXFP8_1D_SCALING:
+      NVTE_CHECK(is_fp8_dtype(input->dtype()), "Input tensor has invalid dtype (expected FP8, got ",
+                 to_string(input->dtype()), ").");
+      break;
+    case NVTE_NVFP4_1D_SCALING:
+      NVTE_CHECK(is_fp4_dtype(input->dtype()), "Input tensor has invalid dtype (expected FP4, got ",
+                 to_string(input->dtype()), ").");
+      break;
+    default:
+      NVTE_ERROR("Invalid scaling mode");
+  }
+
+  const bool has_rowwise_scale_inv = input->scale_inv.has_data();
+  const bool has_columnwise_scale_inv = input->columnwise_scale_inv.has_data();
+  NVTE_CHECK(!has_rowwise_scale_inv || !has_columnwise_scale_inv,
+             "Input tensor has both row-wise and column-wise scaling factors");
+  if (!has_rowwise_scale_inv && !has_columnwise_scale_inv) {
+    return;
+  }
+
+  int m{0}, k{0};
+  switch (scaling_mode) {
+    case NVTE_MXFP8_1D_SCALING: {
+      if (has_rowwise_scale_inv) {
+        NVTE_CHECK(input->scale_inv.shape.size() == 2,
+                   "Expected 2D scaling factors, got shape=", input->scale_inv.shape, ".");
+        m = input->scale_inv.shape[0];
+        k = input->scale_inv.shape[1];
+      } else if (has_columnwise_scale_inv) {
+        NVTE_CHECK(input->columnwise_scale_inv.shape.size() == 2,
+                   "Expected 2D scaling factors, got shape=", input->columnwise_scale_inv.shape,
+                   ".");
+        m = input->columnwise_scale_inv.shape[1];
+        k = input->columnwise_scale_inv.shape[0];
+      }
+      break;
+    }
+    case NVTE_NVFP4_1D_SCALING: {
+      if (has_rowwise_scale_inv) {
+        NVTE_CHECK(input->scale_inv.shape.size() == 2,
+                   "Expected 2D scaling factors, got shape=", input->scale_inv.shape, ".");
+        m = input->scale_inv.shape[0];
+        k = input->scale_inv.shape[1];
+      } else if (has_columnwise_scale_inv) {
+        NVTE_CHECK(input->columnwise_scale_inv.shape.size() == 2,
+                   "Expected 2D scaling factors, got shape=", input->columnwise_scale_inv.shape,
+                   ".");
+        m = input->columnwise_scale_inv.shape[0];
+        k = input->columnwise_scale_inv.shape[1];
+      }
+      break;
+    }
+    default:
+      NVTE_ERROR("Invalid scaling mode");
+  }
+
+  constexpr int SF_TILE_DIM_M = 128;
+  constexpr int SF_TILE_DIM_K = 4;
+  NVTE_CHECK(m % SF_TILE_DIM_M == 0, "Input should be padded in M/N dimension!");
+  NVTE_CHECK(k % SF_TILE_DIM_K == 0, "Input should be padded in K dimension!");
+
+  if (has_rowwise_scale_inv) {
+    NVTE_CHECK(output->scale_inv.has_data(),
+               "Output tensor does not have row-wise scaling factors.");
+  }
+  if (has_columnwise_scale_inv) {
+    NVTE_CHECK(output->columnwise_scale_inv.has_data(),
+               "Output tensor does not have column-wise scaling factors.");
+  }
+
+  bool rowwise_unswizzle{false}, columnwise_unswizzle{false};
+  switch (scaling_mode) {
+    case NVTE_MXFP8_1D_SCALING: {
+      rowwise_unswizzle = has_rowwise_scale_inv;
+      columnwise_unswizzle = has_columnwise_scale_inv;
+      break;
+    }
+    case NVTE_NVFP4_1D_SCALING: {
+      rowwise_unswizzle = true;
+      columnwise_unswizzle = false;
+      break;
+    }
+    default:
+      NVTE_ERROR("Invalid scaling mode");
+  }
+
+  const dim3 block_size(TB_DIM, TB_DIM);
+  const int num_tiles_m = m / SF_TILE_DIM_M;
+  const int num_tiles_k = k / SF_TILE_DIM_K;
+


The code is pretty convoluted here and it doesn't have to be. There are some pieces there that you could do at the beginning without looking at the scaling factor (like checking whether the input has scale_inv/columnwise_scale_inv and checking if the output has them too). For the rest I would say that avoiding code duplication here is not worth breaking of the flow of NVFP4/MXFP8 specific logic, so I would probably just have a larger switch with 2 completely separate code paths rather than multiple switch statements.

ptrendx · 2026-03-16T22:50:03Z

transformer_engine/common/swizzle/swizzle.cu

+  if (has_rowwise_scale_inv) {
+    NVTE_CHECK(output->scale_inv.has_data(),
+               "Output tensor does not have row-wise scaling factors.");
+  }
+  if (has_columnwise_scale_inv) {
+    NVTE_CHECK(output->columnwise_scale_inv.has_data(),
+               "Output tensor does not have column-wise scaling factors.");
+  }


I would say that the logic here is a little backwards, even though I understand how here it is not obvious. Ultimately it is the output that tells you what to do in the function - think about the quantize function where the input does not know anything about the format to which it is quantized and it is the output that controls scaling mode and whether we need rowwise or columnwise quantization. Therefore here I would also treat the output as a "source of truth" on what we need to do and then check that the input tensor provides the right data (as opposed to this code which looks to input to know what to do and then checks the output).

Chaned this for single tensor. Let me know if that makes sense. Can you tell me how this would be called so that I can check the input and output and how they are allocated. Currently I am assuming from your comment above that the output would have all the necessary information to decide between rowwise, columnwise, scaling_mode and data pointers along with dimensions such as m and k. If this is fine then I can make these changes to multi tensor version as well.

Yes, the changes look good. Please update the multitensor version accordingly.

transformer_engine/common/swizzle/swizzle.cu

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

transformer_engine/common/swizzle/swizzle.cu

greptile-apps · 2026-03-17T03:27:51Z

tests/cpp/operator/test_swizzle.cu

+std::vector<std::pair<size_t, size_t>> unswizzle_data_shapes = {
+  // Aligned: scale dims are already multiples of 128 and 4
+  {128, 128},
+  {128, 16896},   // K = 132 * 128, large K
+  {16896, 128},   // M = 132 * 128, large M
+  // M-padding only: M not a multiple of 128 (scale-M needs padding to 256)
+  {160, 128},
+  // scale-K padding only: K/32 = 3, padded to 4
+  {128, 96},
+  // Both M and scale-K need padding
+  {160, 96},
+};
+
 std::vector<std::pair<bool, bool>> scaling_mode = {
  {true, false},


Roundtrip test only covers aligned matrix dimensions

performTestSwizzleUnswizzleRoundtrip is instantiated exclusively with the existing num_tiles vector, which always produces M = num_tiles_M * MAT_TILE_DIM_M — values that are exact multiples of 128 (the scale-M alignment). The standalone performTestUnswizzle1D intentionally adds padded shapes (e.g., M=160, K=96) via unswizzle_data_shapes, but no equivalent padded cases exist for the roundtrip.

If the output-size validation or padding-mask logic ever diverges between the swizzle and unswizzle paths for non-aligned M/K, the roundtrip test would pass while standalone tests fail (or vice-versa). Consider adding a few padded shapes (e.g., {4, 3} tile-count pairs or raw {160, 96} shapes) to num_tiles or creating a separate data-shape vector for the roundtrip suite.

@int-smart Is there a reason for that difference between the tests?

Just wanted to keep one test similar to swizzle tests which has aligned test cases. Moved to using unswizzled_data_shapes for roundtrip as well with the aligned cases as part of the unswizzled_data_shapes.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

tests/cpp/operator/test_swizzle.cu

…streamline processing. Need to check if rowwise and columnwise both can be true. If yes the if else needs to account for that Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps · 2026-03-19T04:18:54Z

transformer_engine/common/swizzle/swizzle.cu

+  const void* input = kernel_args.input_list[tensor_id];
+  void* output = kernel_args.output_list[tensor_id];
+  const int M = kernel_args.m_list[tensor_id];
+  const int K = kernel_args.k_list[tensor_id];
+
+  constexpr int N_TILE_PER_TD = sizeof(LType) / sizeof(int);
+  constexpr int N_TILES_IN_TB = TB_DIM * N_TILE_PER_TD;
+
+  const int num_tiles_k = K / SF_TILE_DIM_K;
+  const int num_tiles_m = M / SF_TILE_DIM_M;
+  const int flat_offset = bid - kernel_args.block_range[tensor_id];
+  const int grid_dim_x = DIVUP(num_tiles_k, N_TILES_IN_TB);
+  const int grid_dim_y = num_tiles_m;
+  const int bid_x = flat_offset / grid_dim_y;
+  const int bid_y = flat_offset % grid_dim_y;
+


unswizzle_col_scaling_kernel_impl: SLM load stride mismatch for multi-block K dimension

The SLM load for each M-tile reads SF_TILE_SIZE_I32 * k_tiles_in_tb contiguous int32s from input_i32[i]:

const int4* input_v4i = reinterpret_cast<const int4*>(input_i32[i]); for (int j = linear_id; j < SF_TILE_SIZE_I32 * k_tiles_in_tb / 4; j += ...) slm_v4i[j] = input_v4i[j];

input_i32[i] is set to base + bid_x * TB_DIM * SF_TILE_SIZE_I32 + mt * SF_TILE_DIM_M_I32 * K_i32, where the stride between adjacent M-tiles in the swizzled layout is SF_TILE_DIM_M_I32 * K_i32.

For a full K-tile block (k_tiles_in_tb == TB_DIM == 32), the write size is 32 * SF_TILE_SIZE_I32 = 32 * 128 = 4096 int32s. The M-tile stride is 32 * K_i32. When K_i32 > 128 (e.g., K = 132 K-tiles), the M-tile stride 32 * 132 = 4224 > 4096, so there is a gap that the read does not cross—it is safe.

However, for the last K-tile block (when bid_x == grid_dim_x - 1) and k_tiles_in_tb < TB_DIM, the following K-tile block for the same M-tile starts at offset (bid_x+1) * TB_DIM * 128 + mt * 32 * K_i32, which is beyond the current read range. This appears correct in isolation, but the stride chosen by the swizzle for that region may leave uninitialised bytes between consecutive partial K-tile writes for the same M-tile.

Consider tracing through with K = 8 K-tiles (K_i32 = 8), M = 128:

num_tiles_k = 8 / SF_TILE_DIM_K_I32 = 8 / 4 = 2; grid_dim_x = DIVUP(2, TB_DIM) = 1

So bid_x=0 is the last block; k_tiles_in_tb = (2-1) % 32 + 1 = 2

Read: 2 * 128 = 256 int32s from 0 + 0 * 32 * 8 = 0

Swizzle stored: 2 * 128 = 256 int32s at offset 0

Layouts agree here. But consider K_i32 = 132, grid_dim_x = 2, bid_x = 1 (last), mt = 1:

input_i32[1] = 1 * 32 * 128 + 1 * 32 * 132 = 4096 + 4224 = 8320

k_tiles_in_tb = (33-1) % 32 + 1 = 1

Read: 1 * 128 = 128 int32s from 8320

Swizzle for (bid_x=1, mt=1) wrote 128 int32s starting at 1 * 32 * 128 + 1 * 32 * 132 = 8320. This matches.

After more careful analysis, the reads do appear correct for the cases tested by the test suite (all aligned shapes). However, the correctness relies on k_tiles_in_tb being computed identically in the swizzle and unswizzle kernels. Please add an assertion or comment clarifying the layout invariant assumed by this contiguous read, and add a test covering non-power-of-two K-tile counts (e.g., K = 132 * 4 = 528 with M = 256) to catch any latent mismatch.

transformer_engine/common/swizzle/swizzle.cu

…put tensors for scaling mode and data validation. Updated checks for input and output tensor shapes to ensure proper handling of row-wise and column-wise scaling factors. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

transformer_engine/common/swizzle/swizzle.cu

transformer_engine/common/include/transformer_engine/swizzle.h

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

transformer_engine/common/swizzle/swizzle.cu

greptile-apps · 2026-03-21T05:06:38Z

transformer_engine/common/swizzle/swizzle.cu

+#pragma unroll
+  for (int i = 0; i < kVectorSize; i++)
+    tmp[i % N_SF_PER_TD_PER_TILE * N_TILE_PER_TD + i / N_SF_PER_TD_PER_TILE] = ptr[i];


Missing parentheses around modulo expression

In C++, % and * share the same precedence and are left-to-right associative, so i % N_SF_PER_TD_PER_TILE * N_TILE_PER_TD is evaluated as (i % N_SF_PER_TD_PER_TILE) * N_TILE_PER_TD. This happens to be the correct inverse formula, but the lack of explicit parentheses makes the intent unclear compared to regs_shuffle's i / N_TILE_PER_TD + i % N_TILE_PER_TD * N_SF_PER_TD_PER_TILE (which has the same ambiguity). Adding explicit parentheses improves readability:

Suggested change

#pragma unroll

for (int i = 0; i < kVectorSize; i++)

tmp[i % N_SF_PER_TD_PER_TILE * N_TILE_PER_TD + i / N_SF_PER_TD_PER_TILE] = ptr[i];

tmp[(i % N_SF_PER_TD_PER_TILE) * N_TILE_PER_TD + i / N_SF_PER_TD_PER_TILE] = ptr[i];

greptile-apps · 2026-03-21T05:06:39Z

transformer_engine/common/swizzle/swizzle.cu

+      NVTE_CHECK(m % SF_TILE_DIM_M == 0, "Input should be padded in M/N dimension!");
+      NVTE_CHECK(k % SF_TILE_DIM_K == 0, "Input should be padded in K dimension!");
+      NVTE_CHECK(k > 0, "Input scale inverse should be 2D!");
+      NVTE_CHECK(m * k == std::accumulate(output[i]->columnwise_scale_inv.shape.begin(),
+                                          output[i]->columnwise_scale_inv.shape.end(), 1,
+                                          std::multiplies<int>()),
+                 "Input.columnwise_scale_inv size is not equal to "
+                 "Output.columnwise_scale_inv size!");


Trivially-true check never validates the swizzled input buffer size

m and k are derived from output[i]->columnwise_scale_inv.shape (shape[1] and shape[0] respectively), so m * k equals output[i]->columnwise_scale_inv.numel() by construction. The std::accumulate on the output shape will always equal m * k, making this check a no-op that never catches a real mismatch. Furthermore, the error message says "Input.columnwise_scale_inv size is not equal to Output.columnwise_scale_inv size!" but the check never reads input[i]->columnwise_scale_inv.numel().

The missing check is against the input (swizzled) size — mirror the same validation added in the rowwise path (line 1380):

NVTE_CHECK(input[i]->columnwise_scale_inv.has_data(), "Input tensor ", i, " does not have column-wise scaling factors."); NVTE_CHECK(static_cast<size_t>(m) * k == input[i]->columnwise_scale_inv.numel(), "Expected input tensor ", i, " to have ", static_cast<size_t>(m) * k, " column-wise scaling factors, but got shape=", input[i]->columnwise_scale_inv.shape, ".");

… use input numel Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps · 2026-03-21T05:23:12Z

transformer_engine/common/swizzle/swizzle.cu

+                   m * k, " row-wise scaling factors, but got shape=", output[i]->scale_inv.shape,
+                   ".");


Misleading error message prints output shape instead of input shape

The NVTE_CHECK on line 1364 tests whether input[i]->scale_inv.numel() equals m * k (where m and k come from output[i]->scale_inv.shape). When the check fails the error message should tell the caller what the input actually contained, but it prints output[i]->scale_inv.shape instead.

Because m * k was already derived from the output shape, the message effectively says "got shape [Y, Z]" where Y*Z == m*k — the very value that was expected — giving the developer no information about what the mismatching input looked like.

Suggested change

m * k, " row-wise scaling factors, but got shape=", output[i]->scale_inv.shape,

".");

NVTE_CHECK(m * k == input[i]->scale_inv.numel(), "Expected input tensor ", i, " to have ",

m * k, " row-wise scaling factors, but got shape=", input[i]->scale_inv.shape,

".");

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps · 2026-03-21T05:35:35Z

transformer_engine/common/swizzle/swizzle.cu

+      NVTE_CHECK(m * k == std::accumulate(input[i]->columnwise_scale_inv.shape.begin(),
+                                          input[i]->columnwise_scale_inv.shape.end(), 1,
+                                          std::multiplies<int>()),
+                 "Input.columnwise_scale_inv size is not equal to "
+                 "Output.columnwise_scale_inv size!");
+
+      int num_tiles_k = k / SF_TILE_DIM_K;
+      int vec_load_size_i = (num_tiles_k - 1) % 4 + 1;
+      vec_load_size = std::min(vec_load_size, vec_load_size_i);
+
+      const int pos = kernel_args.num_tensors;
+      kernel_args.input_list[pos] = const_cast<void*>(input[i]->columnwise_scale_inv.dptr);


Missing has_data() validation for input columnwise scaling factors

The columnwise unswizzle path accesses input[i]->columnwise_scale_inv.shape (line 1435) and .dptr (line 1446) without first checking that the data actually exists. The validation loop at lines 1330–1332 only checks output[i]->columnwise_scale_inv.has_data() — not the input.

If input[i]->columnwise_scale_inv.has_data() is false, this block will silently compute with a null/zero numel() on line 1435 (causing the size check to fail spuriously) and then pass a null pointer to the kernel on line 1446, producing undefined behaviour.

By contrast, the analogous rowwise path at lines 1358–1359 explicitly guards against this with:

NVTE_CHECK(input[i]->scale_inv.has_data(), "Input tensor ", i, " does not have row-wise scaling factors.");

The same guard should be added before line 1435:

NVTE_CHECK(input[i]->columnwise_scale_inv.has_data(), "Input tensor ", i, " does not have column-wise scaling factors.");

greptile-apps · 2026-03-21T05:35:36Z

transformer_engine/common/swizzle/swizzle.cu

+        // Example for NVFP4 rowwise path:  
+        NVTE_CHECK(static_cast<size_t>(m) * k == input->scale_inv.numel(),  
+                  "Expected input tensor to have ", static_cast<size_t>(m) * k, 
+                  " row-wise scaling factors, but got shape=", input->scale_inv.shape, ".");


Leftover code-review comment should be removed

The comment // Example for NVFP4 rowwise path: (with trailing whitespace on lines 1225–1226) appears to be an artefact of a reviewer's suggested code snippet that was copy-pasted directly into the source. It is not a meaningful code comment and will confuse future readers about why this block is labelled "example".

Suggested change

// Example for NVFP4 rowwise path:

NVTE_CHECK(static_cast<size_t>(m) * k == input->scale_inv.numel(),

"Expected input tensor to have ", static_cast<size_t>(m) * k,

" row-wise scaling factors, but got shape=", input->scale_inv.shape, ".");

NVTE_CHECK(static_cast<size_t>(m) * k == input->scale_inv.numel(),

"Expected input tensor to have ", static_cast<size_t>(m) * k,

" row-wise scaling factors, but got shape=", input->scale_inv.shape, ".");

greptile-apps · 2026-03-21T05:35:37Z

transformer_engine/common/swizzle/swizzle.cu

+
+void multi_tensor_unswizzle_scaling_factors(const std::vector<Tensor*>& input,
+                                            std::vector<Tensor*>& output, cudaStream_t stream) {
+  size_t num_tensors = output.size();


num_tensors derived from output rather than input

num_tensors is set to output.size(), but the loop body accesses both input[i] and output[i] up to that count. The symmetric multi_tensor_swizzle_scaling_factors (line 971) uses input.size() as the canonical source.

If a caller passes input.size() < output.size() (which can happen when calling the internal function directly, before the public nvte_multi_tensor_unswizzle_scaling_factors wrapper fills both vectors symmetrically), accessing input[i] for i >= input.size() is undefined behaviour.

Suggested change

size_t num_tensors = output.size();

size_t num_tensors = input.size();

greptile-apps · 2026-03-21T05:35:38Z

transformer_engine/common/swizzle/swizzle.cu

+  NVTE_CHECK(!has_rowwise_scale_inv || !has_columnwise_scale_inv,
+             "Output tensor has both row-wise and column-wise scaling factors");


Dual-scale tensors rejected in single-tensor unswizzle, breaking round-trip symmetry

unswizzle_scaling_factors hard-rejects any tensor that has both rowwise and columnwise scaling factors (line 1155–1156). The corresponding swizzle_scaling_factors (line 599) has the same check, so calling the public round-trip pair:

nvte_swizzle_scaling_factors(input, swizzled, stream); // succeeds — handles both scales nvte_unswizzle_scaling_factors(swizzled, output, stream); // FAILS — "Output tensor has both…"

will throw at runtime for any dual-path MXFP8 tensor (common in training). The implementation already has independent rowwise and columnwise kernel paths (if (has_rowwise_scale_inv) / else if (has_columnwise_scale_inv) at lines 1183 and 1197), so the restriction is artificial. Lifting it requires changing else if (has_columnwise_scale_inv) to if (has_columnwise_scale_inv) and executing both paths sequentially when both flags are set — exactly mirroring swizzle_scaling_factors's design.

int-smart and others added 6 commits March 3, 2026 20:40

Add swizzle/unswizzle roundtrip test for scaling factors

6a064cf

These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

621bc16

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

vthumbe1503 added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Mar 4, 2026

Added greptile suggestions

17dbb33

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

int-smart force-pushed the feature/unswizzle branch from 85ea04b to 17dbb33 Compare March 5, 2026 02:13

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd0e4e2

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

tests/cpp/operator/test_swizzle.cu Show resolved Hide resolved

tests/cpp/operator/test_swizzle.cu Show resolved Hide resolved

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

int-smart and others added 2 commits March 4, 2026 18:49

Removed unused check from tests and reading input directly as const r…

57c8532

…ather than casting Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d7b6d2d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

int-smart and others added 2 commits March 12, 2026 19:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

4410e9d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

ptrendx reviewed Mar 16, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Show resolved Hide resolved

Refactor unswizzling scaling factors to use a launch function

8e272a7

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

Change unswizzling to use output as gt.

bc1fb51

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

tests/cpp/operator/test_swizzle.cu Show resolved Hide resolved

Refactor unswizzling scaling factors to improve input validation and …

cf262c0

…streamline processing. Need to check if rowwise and columnwise both can be true. If yes the if else needs to account for that Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

int-smart added 2 commits March 20, 2026 15:28

Enhance swizzle tests and unswizzling validation

38cec8c

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Show resolved Hide resolved

transformer_engine/common/swizzle/swizzle.cu Outdated Show resolved Hide resolved

transformer_engine/common/include/transformer_engine/swizzle.h Outdated Show resolved Hide resolved

Fix typos and update validation checks in swizzle.cu

abb0b29

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

Update validation checks in multi_tensor_unswizzle_scaling_factors to…

dbf6c34

… use input numel Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

Typo

ed009f2

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

		m * k, " row-wise scaling factors, but got shape=", output[i]->scale_inv.shape,
		".");

-                   m * k, " row-wise scaling factors, but got shape=", output[i]->scale_inv.shape,
-                   ".");
+        NVTE_CHECK(m * k == input[i]->scale_inv.numel(), "Expected input tensor ", i, " to have ",
+                   m * k, " row-wise scaling factors, but got shape=", input[i]->scale_inv.shape,
+                   ".");

	size_t num_tensors = output.size();
	size_t num_tensors = input.size();

		NVTE_CHECK(!has_rowwise_scale_inv \|\| !has_columnwise_scale_inv,
		"Output tensor has both row-wise and column-wise scaling factors");

Conversation

int-smart commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx commented Mar 11, 2026

Uh oh!

int-smart commented Mar 12, 2026

Uh oh!

int-smart commented Mar 12, 2026

Uh oh!

Uh oh!

ptrendx commented Mar 16, 2026

Uh oh!

int-smart commented Mar 16, 2026

Uh oh!

ptrendx Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

int-smart Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

int-smart Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 21, 2026

Choose a reason for hiding this comment

int-smart commented Mar 4, 2026 •

edited

Loading

greptile-apps bot commented Mar 4, 2026 •

edited

Loading