[ROCm] Add AMD Instinct MI300X/MI325X/MI350X/MI355X GPU support by andyluo7 · Pull Request #3 · RightNow-AI/autokernel

andyluo7 · 2026-03-12T14:34:34Z

Summary

Add GPU detection and performance specs for AMD Instinct GPUs (MI300X, MI325X, MI350X, MI355X) to enable AutoKernel on ROCm.

Problem

On ROCm, torch.cuda.get_device_properties() returns an empty device name and clock_rate=0, causing:

GPU not matched in _KNOWN_GPUS table → falls back to estimation
clock_rate / 1e6 in fallback path → incorrect zero/near-zero TFLOPS estimates
Roofline analysis and performance metrics are meaningless

Solution

GPU Database (`bench.py` + `profile.py`)

Add MI300X (1307.4 TFLOPS, 5300 GB/s, 256 MB L2)
Add MI325X (1307.4 TFLOPS, 6000 GB/s, 256 MB L2)
Add MI350X (2300.0 TFLOPS, 8000 GB/s, 256 MB L2)
Add MI355X (2300.0 TFLOPS, 8000 GB/s, 256 MB L2)

ROCm-aware GPU Detection (`bench.py`)

New _KNOWN_AMD_GPUS dict keyed by gcnArchName prefix (e.g. gfx942 → MI300X)
When device name is empty, fall back to props.gcnArchName for identification
Guard clock_rate access behind hasattr + > 0 check

Same fixes applied to `profile.py` fallback detector

Testing

Tested end-to-end on AMD Instinct MI300X (gfx942, ROCm 6.3, PyTorch 2.9):

=== GPU INFO ===
gpu_name: AMD Instinct MI300X
gpu_sm_count: 304
gpu_memory_gb: 192.0
gpu_peak_tflops_fp16: 1307.4
gpu_peak_bandwidth_gb_s: 5300.0
gpu_l2_cache_mb: 256.0

=== CORRECTNESS ===
smoke_test: PASS
FP16 shape_sweep: all PASS
BF16 shape_sweep: all PASS
numerical_stability: PASS
determinism: PASS
edge_cases: PASS

=== PERFORMANCE (xlarge) ===
PyTorch baseline: 607.9 TFLOPS (46.5% peak)
Benchmark harness: runs end-to-end ✅

Zero NVIDIA impact

Existing _KNOWN_GPUS entries unchanged
clock_rate path only skipped when clock_rate is 0 or missing (never happens on NVIDIA)
No new dependencies

- Add MI300X, MI325X, MI350X, MI355X to _KNOWN_GPUS table with correct peak FP16 TFLOPS, memory bandwidth, and L2 cache specs - Add gcnArchName-based GPU detection for ROCm (device name is often empty on ROCm; gcnArchName like 'gfx942' is always available) - Guard clock_rate access behind hasattr check (ROCm devices report clock_rate=0, causing division issues in fallback estimation) - Apply same fixes to profile.py fallback detector Tested on AMD Instinct MI300X (gfx942) with ROCm 6.3 / PyTorch 2.9: - GPU correctly detected as 'AMD Instinct MI300X' - All FP16/BF16 correctness tests PASS - Benchmark harness runs end-to-end - PyTorch baseline: 607.9 TFLOPS on xlarge matmul (46.5% peak)

andyluo7 · 2026-03-12T21:56:19Z

✅ Verified on AMD Instinct MI350X (gfx950 / CDNA4)

Tested end-to-end on 8x MI350X (ROCm 7.2, PyTorch 2.10.0+rocm7.0, Triton 3.6.0):

GPU Detection

gpu_name: AMD Instinct MI350X VF
gpu_sm_count: 256
gpu_memory_gb: 287.6
gpu_peak_tflops_fp16: 2300.0
gpu_peak_tflops_bf16: 2300.0
gpu_peak_bandwidth_gb_s: 8000.0
gpu_l2_cache_mb: 256.0
gpu_compute_capability: 9.5

GPU correctly identified via gcnArchName: gfx950 → matched "MI350X" in _KNOWN_GPUS.

Correctness

Test	Result
Smoke test	✅ PASS
FP16 shape sweep (10/10)	✅ PASS
BF16 shape sweep (10/10)	✅ PASS
FP32 shape sweep (7/10)	⚠️ 3 fail (xlarge/deep_k/llm_mlp — tight atol, expected)
Numerical stability	✅ PASS
Determinism	✅ PASS
Edge cases	✅ PASS

Performance (starter kernel, not optimized)

size            kernel_us   pytorch_us    speedup     tflops    %peak
------------------------------------------------------------------
tiny                 6.72         9.91     1.473x      0.624     0.0%
small               13.22         9.76     0.738x     20.298     0.9%
large               48.13        32.15     0.668x    356.930    15.5%
xlarge             323.68       129.86     0.401x    424.614    18.5%
llm_mlp            908.00       452.45     0.498x    406.791    17.7%

PyTorch baseline xlarge BF16: 956.5 TFLOPS (41.6% of 2300 TFLOPS peak).

Environment

ROCm 7.2.0, PyTorch 2.10.0+rocm7.0, Triton 3.6.0
Native gfx950 support (no HSA_OVERRIDE_GFX_VERSION needed)
torch.cuda.get_device_properties() returns gcnArchName: gfx950:sramecc+:xnack-

Previously also verified on MI300X (gfx942, ROCm 6.3, PyTorch 2.9) — see PR description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add AMD Instinct MI300X/MI325X/MI350X/MI355X GPU support#3

[ROCm] Add AMD Instinct MI300X/MI325X/MI350X/MI355X GPU support#3
andyluo7 wants to merge 1 commit intoRightNow-AI:mainfrom
andyluo7:amd-gpu-support

andyluo7 commented Mar 12, 2026

Uh oh!

andyluo7 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andyluo7 commented Mar 12, 2026

Summary

Problem

Solution

GPU Database (bench.py + profile.py)

ROCm-aware GPU Detection (bench.py)

Same fixes applied to profile.py fallback detector

Testing

Zero NVIDIA impact

Uh oh!

andyluo7 commented Mar 12, 2026

✅ Verified on AMD Instinct MI350X (gfx950 / CDNA4)

GPU Detection

Correctness

Performance (starter kernel, not optimized)

Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GPU Database (`bench.py` + `profile.py`)

ROCm-aware GPU Detection (`bench.py`)

Same fixes applied to `profile.py` fallback detector