Skip to content

Conversation

@arai713
Copy link
Contributor

@arai713 arai713 commented Jan 5, 2026

This PR introduces the generation of unit tests for Stream-K using Tile Engine. This will allow us to scale up the unit tests we have for better coverage and maintainability. It establishes a small test targeting fp16 and bf16 data types and covers both atomic and parallel reduction strategies within the compv3 pipeline. It also lays the groundwork for expanding to the full set of Stream-K smoke and extended tests.

These are the supported instances:

test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Comment on lines +7 to +9
{"m": 256, "n": 256, "k": 128, "split_k": 1},
{"m": 512, "n": 256, "k": 256, "split_k": 1},
{"m": 256, "n": 512, "k": 256, "split_k": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these sizes cover the existing cases in test/ck_tile/gemm_streamk?
If so, did we want to remove the old test files and replace them with these to avoid adding redundant tests?

@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch 2 times, most recently from 7db8519 to 0e6e6a7 Compare January 9, 2026 19:52
@arai713 arai713 marked this pull request as ready for review January 9, 2026 19:52
- **SKIPPED**: Kernel validation returned "Arguments not supported" (expected for certain problem sizes/configurations) ⚠️
- **FAILED**: Actual error or incorrect computation results ❌

When a kernel's `IsSupportedArgument()` check fails (e.g., due to vector alignment requirements, dimension constraints, or padding limitations), the test is automatically skipped rather than failed. This allows comprehensive testing across various problem sizes while gracefully handling configurations that don't meet specific kernel requirements.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is there should never be a set of arguments passed to the test suite that are not supported.
We should have a set of black and white tests that we know will either absolutely pass or absolutely fail.

Because in a scenario, where IsSupportedAruguments()'s implementation is changed, for instance someone unintentionally reduces the vector alignment requirements, it will result in valid kernels just being skipped rather than failing. So, we might miss this regression.

IsSupportedArgument() is pretty fragile in a sense because it calls a lot of getter functions from all over the place.

This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.
@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch from 0e6e6a7 to 89e4aba Compare January 12, 2026 19:57
Comment on lines +250 to +270
# Enable parallel compilation optimizations
# Set up job pools for better parallel compilation control
set_property(GLOBAL PROPERTY JOB_POOLS
compile_heavy=4 # Limit heavy compilations to prevent OOM
compile_normal=16 # Allow more parallel normal compilations
)

# Enable compiler cache if available and explicitly requested
# Disabled by default due to permission issues in CI environments
option(ENABLE_CCACHE_TESTS "Enable ccache for test compilation" OFF)
if(ENABLE_CCACHE_TESTS)
find_program(CCACHE_PROGRAM ccache)
if(CCACHE_PROGRAM)
set(CMAKE_CXX_COMPILER_LAUNCHER ${CCACHE_PROGRAM})
message(STATUS "Using ccache for faster test compilation")
else()
message(WARNING "ccache requested but not found")
endif()
else()
message(STATUS "ccache disabled for tests (use -DENABLE_CCACHE_TESTS=ON to enable)")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabbed in.

Also there are job pools set up here. Where are they used?

Comment on lines +20 to +21
- **`--gen_individual`**: Generate all kernel headers in parallel during CMake configuration
- **`--gen_single`**: Generate individual kernel header for each configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between these two is not very clear for me


## Data Type Support
- ✅ **fp16, bf16**: Fully supported - all layouts (rcr, rrr, ccr, crr)
- ❌ **fp64**: Not supported (hardware MFMA limitation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp64 mfma is supported on gfx9 (minus gfx908). Is this more of a CK limitation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants