-
Notifications
You must be signed in to change notification settings - Fork 56
Add TMA unittest app #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
tgrogers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good, but should we put it in the uBench folder?
|
This CI build may take some time to build all the GEMM kernels for GMMA. Perhaps we should build a single kernel instead from the test source files (i.e., Or let the cluster build this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds new GPU microbenchmarking capabilities for TMA and GMMA operations while fixing exit codes and enabling dynamic CUDA runtime linking. The changes modernize the build system to support parallel compilation and add comprehensive test coverage for Hopper architecture (SM90) matrix operations.
Key Changes:
- Corrected return values from
1to0for successful execution across all microbenchmark programs - Added GMMA (General Matrix Multiply-Accumulate) latency and throughput microbenchmarks with comprehensive test coverage for multiple data types
- Enabled parallel NVCC compilation while preserving embedded PTX code generation
Reviewed changes
Copilot reviewed 299 out of 331 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
Multiple *.cu files |
Fixed incorrect return value from 1 to 0 for successful program completion |
lat_gmma/ directory |
Added comprehensive GMMA latency microbenchmark infrastructure with support for F32, F16, and INT32 accumulators |
MaxFlops_gmma/ directory |
Added GMMA throughput microbenchmark kernels for various data type combinations |
lat_gmma_common.h |
Introduced shared kernel templates and helper macros for GMMA latency testing |
lat_gmma.h |
Defined function declarations for 385 different GMMA test configurations |
lat_gmma/Makefile |
Configured build system for SM90a architecture with C++17 and parallel compilation |
.gitignore |
Added ignore patterns for build artifacts (.a and .ptx files) |
Comments suppressed due to low confidence (1)
src/cuda/GPU_Microbenchmark/ubench/core/lat_gmma/lat_gmma_common.h:1
- Corrected grammar from 'Simple a test kernel' to 'Simple test kernel'.
/***************************************************************************************************
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
CI is failing due to insufficient space on the runner. |
Uh oh!
There was an error while loading. Please reload this page.