[WIP] Grouped GEMM with ck_tile by matthiasdiener · Pull Request #434 · ROCm/TransformerEngine

matthiasdiener · 2026-01-28T15:49:27Z

Description

See https://github.com/ROCm/frameworks-internal/issues/13792 for context.

TODOs:

Enable tests in test_numerics.py
Make kernels selectable & tunable
Handle gelu/bias (or make sure these are not passed in)
Performance analysis and improvements: https://github.com/ROCm/frameworks-internal/issues/15185#issuecomment-3863052452
More tests

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 86fbbac.

…mTest

wangye805 · 2026-02-02T04:17:46Z

tests/pytorch/test_numerics.py

    delay_wgrad_compute,
 ):
    os.environ["NVTE_USE_CUTLASS_GROUPED_GEMM"] = "1"
+    if IS_HIP_EXTENSION:


Is our CK grouped gemm a drop-in replacement with NV upstream CUTLASS grouped gemm? If so, we can share the same env. It's like cublaslt vs hipblaslt...

It mostly is a drop-in replacement for upstream, so I changed the envs to the upstream versions in 259645c

wangye805 · 2026-02-02T04:22:57Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+struct TileCfg_basic {
+  static constexpr ck_tile::index_t M_Tile = 256;
+  static constexpr ck_tile::index_t N_Tile = 128;
+  static constexpr ck_tile::index_t K_Tile = 64;
+
+  static constexpr ck_tile::index_t M_Warp = 2;
+  static constexpr ck_tile::index_t N_Warp = 2;
+  static constexpr ck_tile::index_t K_Warp = 1;
+
+  static constexpr ck_tile::index_t M_Warp_Tile = 32;
+  static constexpr ck_tile::index_t N_Warp_Tile = 32;
+  static constexpr ck_tile::index_t K_Warp_Tile = 16;
+
+  static constexpr bool kPadM = true;
+  static constexpr bool kPadN = true;
+  static constexpr bool kPadK = true;
+
+  static constexpr bool DoubleSmemBuffer = false;
+
+  static constexpr ck_tile::index_t TilePartitionerGroupNum = 8;
+  static constexpr ck_tile::index_t TilePartitionerM01      = 1;
+};
+
+template <typename AType, typename BType, typename CType,
+          typename ALayout, typename BLayout, typename CLayout,
+          typename TileCfg, ck_tile::memory_operation_enum MemOp,
+          typename AccType = float>
+class Runner{
+public:
+  using GemmShape = ck_tile::TileGemmShape<
+      ck_tile::sequence<TileCfg::M_Tile, TileCfg::N_Tile, TileCfg::K_Tile>,
+      ck_tile::sequence<TileCfg::M_Warp, TileCfg::N_Warp, TileCfg::K_Warp>,
+      ck_tile::sequence<TileCfg::M_Warp_Tile, TileCfg::N_Warp_Tile, TileCfg::K_Warp_Tile>>;
+
+  using Partitioner = ck_tile::GemmSpatiallyLocalTilePartitioner<
+      GemmShape, TileCfg::TilePartitionerGroupNum, TileCfg::TilePartitionerM01>;
+
+  using UniversalTraits = ck_tile::PersistentTileGemmUniversalTraits<
+      TileCfg::kPadM, TileCfg::kPadN, TileCfg::kPadK,
+      TileCfg::DoubleSmemBuffer, ALayout, BLayout, CLayout>;
+
+  static constexpr ck_tile::GemmPipelineScheduler Scheduler =
+      ck_tile::GemmPipelineScheduler::Intrawave;
+
+  using Problem = ck_tile::UniversalGemmPipelineProblem<
+      AType, BType, AccType, GemmShape, UniversalTraits, Scheduler>;
+
+  using Pipeline = ck_tile::GemmPipelineAgBgCrCompV3<Problem>;
+
+  using Epilogue = ck_tile::CShuffleEpilogue<
+      ck_tile::CShuffleEpilogueProblem<
+          AType, BType, ck_tile::tuple<>, AccType,
+          CType, ck_tile::tuple<>, CLayout,
+          ck_tile::element_wise::PassThrough,
+          Partitioner::MPerBlock, Partitioner::NPerBlock,
+          TileCfg::M_Warp, TileCfg::N_Warp,
+          TileCfg::M_Warp_Tile, TileCfg::N_Warp_Tile, TileCfg::K_Warp_Tile,
+          Problem::TransposeC, MemOp>>;
+
+  using Kernel = ck_tile::GroupedGemmKernel<Partitioner, Pipeline, Epilogue>;
+};


Are these codes from CK repo? If so, can you add a comment to point to the reference?

I added a reference in fac7c11.

I can see the comment with the reference to CK repo, so I am resolving this.

wangye805 · 2026-02-02T04:24:43Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+  std::vector<ck_tile::GroupedGemmHostArgs<0>> descs;
+  descs.reserve(group_num);


Why not put group_num inside the desc vector definition?

I used reserve() here instead of std::vector<ck_tile::GroupedGemmHostArgs<0>> descs(group_num); to avoid default-constructing GroupedGemmHostArgs objects that are immediately overwritten, to reduce construction overhead.

wangye805 · 2026-02-02T04:25:10Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+  using R = Runner<T, T, T, ALayout, BLayout, CLayout, TileCfg_basic, MemOp>;
+  using Kernel = typename R::Kernel;


This R is not used anywhere else

I merged R into the next line in fac7c11.

wangye805 · 2026-02-02T04:27:42Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+    if (a.shape.size() != 2 || b.shape.size() != 2 || d.shape.size() != 2) {
+      NVTE_ERROR("grouped_gemm_ck_tile: expected all groups to be 2D.");
+      return false;
+    }


Does grouped gemm support generalized matrices from high-dimensional tensors? Regular gemm supports that. And TE treat the last dim as col with other dimensions as row:

TransformerEngine/transformer_engine/common/common.h

Lines 238 to 262 in 9d6b0e5

size_t flat_first_dim() const {

const auto &full_shape = shape();

size_t ret = 1;

if (!full_shape.empty()) {

for (size_t i = 0; i < full_shape.size() - 1; i++) {

ret *= full_shape[i];

}

}

return ret;

}

/*! Matrix width after tensor is flattened to 2D

*

* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted

* as a (D1*D2*...*D(n-1), Dn) matrix.

*/

size_t flat_last_dim() const {

const auto &full_shape = shape();

if (full_shape.empty()) {

return 1;

} else {

return full_shape.back();

}

}

};

I added (untested) support for higher-dim tensors in dd3ed2f

wangye805 · 2026-02-02T04:37:09Z

transformer_engine/common/gemm/ck_grouped_gemm.cuh

+  }
+}
+
+bool grouped_gemm_ck_tile(const NVTETensor* A,


Why do we overload this function? In cublaslt_gemm.cu, it's only called by this signature. Perhaps we can rename the grouped_gemm_ck_tile in line 255

I simplified this in 259645c so that there is no more overload (only this signature remains).

wangye805 · 2026-02-02T04:38:32Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+      transformer_engine::getenv<bool>("NVTE_CK_GROUPED_GEMM_WARN_FALLBACK", false);
+
+  auto is_supported_dtype = [&]() -> bool {
+    auto *inputA = transformer_engine::convertNVTETensorCheck(A[0]);


Is it possible that num_group=0 so A[0] access not valid?

a42f7ca removed the separate implementation of is_supported_dtype (among others), changing the code to use the upstream version of that function for the most part. I believe the upstream code has the same issue, so I added an early exit in 0b16287. What do you think?

wangye805 · 2026-02-02T04:41:16Z

transformer_engine/common/CMakeLists.txt

+set(CK_ROOT ${CMAKE_SOURCE_DIR}/../../3rdparty/aiter/3rdparty/composable_kernel)
+
+target_include_directories(transformer_engine
+  BEFORE PRIVATE


Why using keyword BEFORE in this target_include_directories? Is it because cmake will not be able to find the correct header files without prioritizing the ck include dirs?

I removed BEFORE in 259645c, compilation still seems to work fine.

wangye805 · 2026-02-02T04:42:26Z

transformer_engine/common/CMakeLists.txt

 target_include_directories(transformer_engine PUBLIC
                           "${CMAKE_CURRENT_SOURCE_DIR}/include")

+set(CK_ROOT ${CMAKE_SOURCE_DIR}/../../3rdparty/aiter/3rdparty/composable_kernel)


CMAKE_SOURCE_DIR --> CMAKE_CURRENT_SOURCE_DIR? Not sure whether other upstream libs will depend on us but let's make it future proof

Changed to in CMAKE_CURRENT_SOURCE_DIR in 259645c.

wangye805 · 2026-02-02T04:44:55Z

transformer_engine/common/gemm/cublaslt_gemm.cu

 #include "common/util/cuda_runtime.h"
+#include "common/util/system.h"
 #ifndef __HIP_PLATFORM_AMD__
 #include "cutlass_grouped_gemm.cuh"


NV upstream made another .cu file for their cutlass_grouped_gemm and compiled it separately. Maybe we can follow their structure for better isolation (avoid CK defining some macros contaminating our cublaslt_gemm.cu)

I restructured this to a cpp file and a header file in 259645c.

matthiasdiener added 16 commits December 9, 2025 17:01

GEMM reference HIP implementation

ad748da

blockwise amax

11e090b

Merge branch 'dev' into compute-ref-offload

9006224

Change to use Tensor arguments, combine mxfp8/non-mxfp8 paths

3ecea7f

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

cafee59

skip on SwizzleScale limitation on gfx950

86fbbac

Revert "skip on SwizzleScale limitation on gfx950"

54de3db

This reverts commit 86fbbac.

MXFP8 fix

311ddfe

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

306e432

correct scale_inv packing and exp2(biased−127) conversion

445e64f

cleanups

462945f

Merge branch 'dev' into compute-ref-offload

e32fb3d

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

7bf8adb

use Tensor class for more device objects

e11e400

Pass D Tensor into run_reference and move RefD allocation into Perfor…

325ece6

…mTest

[WIP] proof-of-concept: grouped GEMM with ck_tile

fc64b8c

matthiasdiener self-assigned this Jan 28, 2026

matthiasdiener added 3 commits January 28, 2026 09:51

Merge branch 'dev' into ck-grouped-gemm

134b350

restructure and enable tests

9091e6c

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7435062

matthiasdiener changed the title ~~[WIP] proof-of-concept: grouped GEMM with ck_tile~~ [WIP] Grouped GEMM with ck_tile Jan 29, 2026

matthiasdiener added 2 commits January 30, 2026 14:09

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

a00a1c8

grid improvements

4e9ead9

wangye805 requested changes Feb 2, 2026

View reviewed changes

restructure

259645c

wenchenvincent requested a review from aris134 February 4, 2026 17:04

matthiasdiener added 4 commits February 4, 2026 15:41

reduce code duplication & simplify

9986bd4

make the code more similar to nv, check emopty gelu/bias

355ec2f

Merge branch 'dev' into ck-grouped-gemm

df5e3ea

further simplify & make closer to nv

a42f7ca

matthiasdiener added 3 commits February 4, 2026 17:07

add ck_tile reference

fac7c11

rename in error messages

71b97e0

allow flattened higher-D tensors

dd3ed2f

aris134 approved these changes Feb 5, 2026

View reviewed changes

matthiasdiener added 2 commits February 5, 2026 12:49

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7b0413e

relax tolerance on gfx942

ebc005f

matthiasdiener force-pushed the ck-grouped-gemm branch from 2095d3f to ebc005f Compare February 5, 2026 19:07

matthiasdiener added 2 commits February 5, 2026 14:53

enable more tests

c0bf502

return early when num_gemms<=0

0b16287

matthiasdiener force-pushed the ck-grouped-gemm branch from d1ab38e to 0b16287 Compare February 5, 2026 21:03

simplify normalization

58b34e7

matthiasdiener requested a review from wangye805 February 5, 2026 23:28

		std::vector<ck_tile::GroupedGemmHostArgs<0>> descs;
		descs.reserve(group_num);

		using R = Runner<T, T, T, ALayout, BLayout, CLayout, TileCfg_basic, MemOp>;
		using Kernel = typename R::Kernel;

	size_t flat_first_dim() const {
	const auto &full_shape = shape();
	size_t ret = 1;
	if (!full_shape.empty()) {
	for (size_t i = 0; i < full_shape.size() - 1; i++) {
	ret *= full_shape[i];
	}
	}
	return ret;
	}

	/*! Matrix width after tensor is flattened to 2D
	*
	* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted
	* as a (D1D2...*D(n-1), Dn) matrix.
	*/
	size_t flat_last_dim() const {
	const auto &full_shape = shape();
	if (full_shape.empty()) {
	return 1;
	} else {
	return full_shape.back();
	}
	}
	};

Conversation

matthiasdiener commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthiasdiener commented Jan 28, 2026 •

edited

Loading