-
Notifications
You must be signed in to change notification settings - Fork 263
Adding dispatcher architecture #3300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
spolifroni-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readmes etc look fine. I can't tell if this is supposed to be used internally for those contributing to the project, or for anyone using the library. If it's for anyone, then a changelog entry is needed to talk about arch_specs.json. Otherwise this is fine.
afagaj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth adding a CHANGELOG.md entry to announce this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Introduce CK Tile Dispatcher architecture with new C++ and Python APIs, codegen tooling, kernel registry, and comprehensive GEMM/Conv examples, plus basic validation and benchmarking.
- Add GEMM and Convolution examples in Python and C++ demonstrating registry/dispatcher usage, validation, and benchmarking.
- Provide shared Python utilities for convolution (conv_utils.py) and codegen assets (requirements, scripts).
- Expand documentation (READMEs) for quick start and example overviews.
Reviewed changes
Copilot reviewed 74 out of 160 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| dispatcher/examples/gemm/python/README.md | Adds quick start and Python GEMM examples overview and usage. |
| dispatcher/examples/gemm/python/01_basic_gemm.py | Basic GEMM example showing manual workflow (config, codegen, registry, dispatch). |
| dispatcher/examples/gemm/python/02_batch_gemm.py | Demonstrates running batches of GEMM problems with padding support. |
| dispatcher/examples/gemm/python/03_benchmark.py | Adds benchmarking script with warmup and TFLOPS reporting. |
| dispatcher/examples/gemm/python/04_validation.py | Validates GPU GEMM outputs against NumPy references with tolerances. |
| dispatcher/examples/gemm/python/05_numpy_integration.py | NumPy integration wrapper (GPUMatmul) and small demos. |
| dispatcher/examples/gemm/python/06_json_export.py | Exports registry/kernels metadata to JSON (and consumes C++ JSON if present). |
| dispatcher/examples/gemm/python/07_preshuffle.py | Preshuffle pipeline example with larger tiles and intrawave scheduler. |
| dispatcher/examples/gemm/python/08_multi_d.py | Multi-D fused GEMM example with CPU simulation and GPU base op timing. |
| dispatcher/examples/gemm/python/09_multi_registry.py | Multiple registries for compute/memory/latency optimized workloads. |
| dispatcher/examples/gemm/cpp/README.md | Adds C++ GEMM examples quick start and overview of example set. |
| dispatcher/examples/gemm/cpp/01_basic_gemm.cpp | Declarative kernel set example and simple GEMM run/verify. |
| dispatcher/examples/gemm/cpp/02_multi_size.cpp | Runs multiple problem sizes and reports TFLOPS. |
| dispatcher/examples/gemm/cpp/03_benchmark.cpp | Benchmark runner with stats (min/median/mean). |
| dispatcher/examples/gemm/cpp/04_validation.cpp | CPU reference and validation against GPU results. |
| dispatcher/examples/gemm/cpp/05_heuristics.cpp | Custom heuristic-based kernel selection demonstration. |
| dispatcher/examples/gemm/cpp/06_json_export.cpp | Registry JSON export and kernel set declarations. |
| dispatcher/examples/gemm/cpp/07_preshuffle.cpp | Preshuffle GEMM example with verification. |
| dispatcher/examples/gemm/cpp/08_multi_d.cpp | Multi-D fused concept demo using standard GEMM run. |
| dispatcher/examples/gemm/cpp/09_multi_registry.cpp | Multiple registries and dispatchers plus summary pattern. |
| dispatcher/examples/conv/python/conv_utils.py | Core Python utilities for conv signature/algorithm/arch, codegen, runners, and validation. |
| dispatcher/examples/conv/python/README.md | Adds Python Conv examples quick start and detailed guide. |
| dispatcher/examples/conv/python/02_conv2d_fwd.py | 2D forward conv example with optional CPU verification and GPU run. |
| dispatcher/examples/conv/python/03_conv3d_fwd.py | 3D forward conv example with CPU reference and GPU run. |
| dispatcher/examples/conv/python/04_conv2d_bwd_data.py | 2D backward data conv with CPU reference and optional GPU execution. |
| dispatcher/examples/conv/python/05_conv2d_bwd_weight.py | 2D backward weight conv with CPU reference and optional GPU execution. |
| dispatcher/examples/conv/python/06_benchmark.py | Conv benchmarking across small problems; optional CPU reference. |
| dispatcher/examples/conv/python/07_validation.py | Conv validation suite vs CPU with detailed analysis. |
| dispatcher/examples/conv/python/08_json_export.py | Export conv registry to JSON with basic stats and examples. |
| dispatcher/examples/conv/python/09_multi_registry.py | Multiple registries for conv workloads with selection heuristics. |
| dispatcher/examples/conv/python/10_conv3d_forward.py | 3D forward conv GPU runner with timing and TFLOPS. |
| dispatcher/examples/conv/python/11_bwd_data.py | Backward data API demo with runners (note: codegen in progress). |
| dispatcher/examples/conv/python/12_bwd_weight.py | Backward weight API demo with dedicated runner (separate lib). |
| dispatcher/codegen/requirements.txt | Adds Python deps for codegen and optional tooling. |
| dispatcher/codegen/minimal_test_config.json | Minimal tile/trait config for test kernel generation. |
| dispatcher/codegen/generate_test_kernels.sh | Shell script to generate a minimal set of CK Tile kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks! A few notes while I'm just starting to look -
|
| | CMake | 3.16+ | `cmake --version` | | ||
| | Python | 3.8+ | `python3 --version` | | ||
| | NumPy | Any | `pip show numpy` | | ||
| | hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some time around rocm 6.4 we were encouraged to switch from hipcc to clang bundled with the ROCm distribution
dispatcher/README.md
Outdated
| - **gfx942** - MI300X, MI300A (Instinct MI300 series) ← Recommended | ||
| - **gfx950** - MI350 series | ||
| - **gfx90a** - MI200 series (MI250, MI250X) | ||
| - **gfx1201** - RDNA4 series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is consistent with the examples. I think currently the printed messages are about gfx11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added gfx11
|
|
||
| ```bash | ||
| # Install NumPy (required for Python examples) | ||
| pip install numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more good practice to encourage - use uv venv for creating virtual environments and uv pip for the packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added note
Fixing typos
1ec1d0d to
53d61e0
Compare
Proposed changes
This PR introduces new dispatching infrastructure for CK Tile, building on prior Tile Engine work, allowing users to isolate and run specific kernels using C++ and Python APIs for GEMMs (universal, preshuffle and multi-D variants). It also adds unified code-generation tools, GPU architecture based kernel filtering, a kernel registry handling mechanism, and a set of examples on how to integrate with other frameworks (C++/Python) together with basic unit and integration tests.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered