Skip to content

Conversation

@William-An
Copy link
Contributor

@William-An William-An commented Jul 13, 2025

@William-An William-An marked this pull request as draft July 13, 2025 16:41
Copy link
Contributor

@tgrogers tgrogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good, but should we put it in the uBench folder?

@William-An William-An marked this pull request as ready for review January 8, 2026 18:34
@William-An
Copy link
Contributor Author

William-An commented Jan 8, 2026

This CI build may take some time to build all the GEMM kernels for GMMA. Perhaps we should build a single kernel instead from the test source files (i.e., lat_gmma_test.cu and MaxFlops_gmma_test.cu)?

Or let the cluster build this?

@William-An William-An requested review from Copilot and tgrogers January 8, 2026 18:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new GPU microbenchmarking capabilities for TMA and GMMA operations while fixing exit codes and enabling dynamic CUDA runtime linking. The changes modernize the build system to support parallel compilation and add comprehensive test coverage for Hopper architecture (SM90) matrix operations.

Key Changes:

  • Corrected return values from 1 to 0 for successful execution across all microbenchmark programs
  • Added GMMA (General Matrix Multiply-Accumulate) latency and throughput microbenchmarks with comprehensive test coverage for multiple data types
  • Enabled parallel NVCC compilation while preserving embedded PTX code generation

Reviewed changes

Copilot reviewed 299 out of 331 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
Multiple *.cu files Fixed incorrect return value from 1 to 0 for successful program completion
lat_gmma/ directory Added comprehensive GMMA latency microbenchmark infrastructure with support for F32, F16, and INT32 accumulators
MaxFlops_gmma/ directory Added GMMA throughput microbenchmark kernels for various data type combinations
lat_gmma_common.h Introduced shared kernel templates and helper macros for GMMA latency testing
lat_gmma.h Defined function declarations for 385 different GMMA test configurations
lat_gmma/Makefile Configured build system for SM90a architecture with C++17 and parallel compilation
.gitignore Added ignore patterns for build artifacts (.a and .ptx files)
Comments suppressed due to low confidence (1)

src/cuda/GPU_Microbenchmark/ubench/core/lat_gmma/lat_gmma_common.h:1

  • Corrected grammar from 'Simple a test kernel' to 'Simple test kernel'.
/***************************************************************************************************

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@William-An
Copy link
Contributor Author

CI is failing due to insufficient space on the runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error while generating traces for GPU_Microbenchmark

2 participants