Skip to content

[ROCm] Add AMD Instinct MI300X/MI325X/MI350X/MI355X GPU support#3

Open
andyluo7 wants to merge 1 commit intoRightNow-AI:mainfrom
andyluo7:amd-gpu-support
Open

[ROCm] Add AMD Instinct MI300X/MI325X/MI350X/MI355X GPU support#3
andyluo7 wants to merge 1 commit intoRightNow-AI:mainfrom
andyluo7:amd-gpu-support

Conversation

@andyluo7
Copy link

Summary

Add GPU detection and performance specs for AMD Instinct GPUs (MI300X, MI325X, MI350X, MI355X) to enable AutoKernel on ROCm.

Problem

On ROCm, torch.cuda.get_device_properties() returns an empty device name and clock_rate=0, causing:

  1. GPU not matched in _KNOWN_GPUS table → falls back to estimation
  2. clock_rate / 1e6 in fallback path → incorrect zero/near-zero TFLOPS estimates
  3. Roofline analysis and performance metrics are meaningless

Solution

GPU Database (bench.py + profile.py)

  • Add MI300X (1307.4 TFLOPS, 5300 GB/s, 256 MB L2)
  • Add MI325X (1307.4 TFLOPS, 6000 GB/s, 256 MB L2)
  • Add MI350X (2300.0 TFLOPS, 8000 GB/s, 256 MB L2)
  • Add MI355X (2300.0 TFLOPS, 8000 GB/s, 256 MB L2)

ROCm-aware GPU Detection (bench.py)

  • New _KNOWN_AMD_GPUS dict keyed by gcnArchName prefix (e.g. gfx942 → MI300X)
  • When device name is empty, fall back to props.gcnArchName for identification
  • Guard clock_rate access behind hasattr + > 0 check

Same fixes applied to profile.py fallback detector

Testing

Tested end-to-end on AMD Instinct MI300X (gfx942, ROCm 6.3, PyTorch 2.9):

=== GPU INFO ===
gpu_name: AMD Instinct MI300X
gpu_sm_count: 304
gpu_memory_gb: 192.0
gpu_peak_tflops_fp16: 1307.4
gpu_peak_bandwidth_gb_s: 5300.0
gpu_l2_cache_mb: 256.0

=== CORRECTNESS ===
smoke_test: PASS
FP16 shape_sweep: all PASS
BF16 shape_sweep: all PASS
numerical_stability: PASS
determinism: PASS
edge_cases: PASS

=== PERFORMANCE (xlarge) ===
PyTorch baseline: 607.9 TFLOPS (46.5% peak)
Benchmark harness: runs end-to-end ✅

Zero NVIDIA impact

  • Existing _KNOWN_GPUS entries unchanged
  • clock_rate path only skipped when clock_rate is 0 or missing (never happens on NVIDIA)
  • No new dependencies

- Add MI300X, MI325X, MI350X, MI355X to _KNOWN_GPUS table with correct
  peak FP16 TFLOPS, memory bandwidth, and L2 cache specs
- Add gcnArchName-based GPU detection for ROCm (device name is often
  empty on ROCm; gcnArchName like 'gfx942' is always available)
- Guard clock_rate access behind hasattr check (ROCm devices report
  clock_rate=0, causing division issues in fallback estimation)
- Apply same fixes to profile.py fallback detector

Tested on AMD Instinct MI300X (gfx942) with ROCm 6.3 / PyTorch 2.9:
- GPU correctly detected as 'AMD Instinct MI300X'
- All FP16/BF16 correctness tests PASS
- Benchmark harness runs end-to-end
- PyTorch baseline: 607.9 TFLOPS on xlarge matmul (46.5% peak)
@andyluo7
Copy link
Author

✅ Verified on AMD Instinct MI350X (gfx950 / CDNA4)

Tested end-to-end on 8x MI350X (ROCm 7.2, PyTorch 2.10.0+rocm7.0, Triton 3.6.0):

GPU Detection

gpu_name: AMD Instinct MI350X VF
gpu_sm_count: 256
gpu_memory_gb: 287.6
gpu_peak_tflops_fp16: 2300.0
gpu_peak_tflops_bf16: 2300.0
gpu_peak_bandwidth_gb_s: 8000.0
gpu_l2_cache_mb: 256.0
gpu_compute_capability: 9.5

GPU correctly identified via gcnArchName: gfx950 → matched "MI350X" in _KNOWN_GPUS.

Correctness

Test Result
Smoke test ✅ PASS
FP16 shape sweep (10/10) ✅ PASS
BF16 shape sweep (10/10) ✅ PASS
FP32 shape sweep (7/10) ⚠️ 3 fail (xlarge/deep_k/llm_mlp — tight atol, expected)
Numerical stability ✅ PASS
Determinism ✅ PASS
Edge cases ✅ PASS

Performance (starter kernel, not optimized)

size            kernel_us   pytorch_us    speedup     tflops    %peak
------------------------------------------------------------------
tiny                 6.72         9.91     1.473x      0.624     0.0%
small               13.22         9.76     0.738x     20.298     0.9%
large               48.13        32.15     0.668x    356.930    15.5%
xlarge             323.68       129.86     0.401x    424.614    18.5%
llm_mlp            908.00       452.45     0.498x    406.791    17.7%

PyTorch baseline xlarge BF16: 956.5 TFLOPS (41.6% of 2300 TFLOPS peak).

Environment

  • ROCm 7.2.0, PyTorch 2.10.0+rocm7.0, Triton 3.6.0
  • Native gfx950 support (no HSA_OVERRIDE_GFX_VERSION needed)
  • torch.cuda.get_device_properties() returns gcnArchName: gfx950:sramecc+:xnack-

Previously also verified on MI300X (gfx942, ROCm 6.3, PyTorch 2.9) — see PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant