Skip to content

ROCm/aiter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,771 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AITER

CI Release Docs Last Commit


AITER (AI Tensor Engine for ROCm) is AMD's high-performance AI operator library, providing optimized GPU kernels for inference and training workloads on ROCm. It serves as a unified collection of production-ready operators that framework developers can integrate directly into their stacks.

Key Features

  • C++ and Python APIs — use operators from either level
  • Multiple kernel backends — Triton, Composable Kernel (CK), and hand-tuned ASM
  • Inference and training — not just serving kernels, but also training and GEMM+communication fused kernels
  • Framework-agnostic — integrate into vLLM, SGLang, or any custom framework

News

Ecosystem

AITER is the default kernel backend for LLM inference on AMD GPUs, integrated into the major serving frameworks and powering production workloads at scale.

Framework Integration

Framework Integration Status Operators Used
vLLM Default attention backend on ROCm Production MHA, MLA, Paged Attention, Fused MoE, GEMM, RMSNorm, RoPE+KVCache
SGLang Default on ROCm Docker Production Attention, Fused MoE, Block-scale GEMM, All-reduce, RMSNorm
ATOM Built natively on AITER Active development All AITER operators (attention, MoE, sampling, communication)
JAX XLA FFI bridge, no PyTorch dependency Experimental MHA/FMHA, RMSNorm, BF16 GEMM
Various customer proprietary inference engines Kernel-level integration Production Attention, MoE, GEMM, quantization

Performance Highlights

Operator Speedup
MLA decode kernel up to 17x
MHA prefill kernel up to 14x
Block-scaled Fused MoE up to 3x
Block-scaled GEMM up to 2x
DeepSeek-R1 e2e (SGLang) 6,484 → 13,704 tok/s (2.1x)
JAX-AITER attention (MI350) 4.39x median

For detailed benchmarks, see the ATOM Benchmark Dashboard.

Supported Hardware

GPU Architecture Status
AMD Instinct MI300X gfx942 (CDNA3) Fully supported
AMD Instinct MI325X gfx942 (CDNA3) Fully supported
AMD Instinct MI350 gfx950 (CDNA4) Supported
AMD Instinct MI355X gfx950 (CDNA4) Supported

Operators

AITER provides optimized kernels for attention, MoE, GEMM, normalization, quantization, communication, and more. Each operator has unit tests under op_tests/ that you can run directly:

# Example: run a single operator test
python3 op_tests/test_mha.py
python3 op_tests/test_mla.py
python3 op_tests/test_moe.py
python3 op_tests/test_gemm_a8w8.py
python3 op_tests/test_rmsnorm2d.py

# See all available operator tests
ls op_tests/test_*.py

Installation

git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py develop

If you happen to forget the --recursive during clone, you can use the following command after cd aiter

git submodule sync && git submodule update --init --recursive

FlyDSL (Optional)

AITER's FusedMoE supports FlyDSL-based kernels for mixed-precision MOE (e.g., A4W4). FlyDSL is optional — when not installed, AITER automatically falls back to CK kernels.

pip install --pre flydsl

Or install all optional dependencies at once:

pip install -r requirements.txt

Opus — Lightweight C++ Template for Kernel Development

Opus is a single-header C++ template library (opus.hpp) for writing HIP kernels on AMD GPUs — vectorized load/store, layout abstractions, and MFMA wrappers with a strong focus on build time optimization (up to 61x faster than standard torch extension builds). See the Opus README and op_tests/opus/ for details.

Triton-based Communication (Iris)

AITER supports GPU-initiated communication using the Iris library. This enables high-performance Triton-based communication primitives like reduce-scatter and all-gather.

pip install -e .
pip install -r requirements-triton-comms.txt

For more details, see docs/triton_comms.md.

About

AI Tensor Engine for ROCm

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors