Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,5 @@ artifacts/
**/times.csv
transformer_engine/build_info.txt
transformer_engine/common/util/hip_nvml.*
.asv/
*.DS_Store
168 changes: 168 additions & 0 deletions benchmarks/asv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Benchmarks for TransformerEngine

GPU microbenchmarks driven by `driver.py`. Results are written in
[ASV (Air Speed Velocity)](https://asv.readthedocs.io/) JSON format so they
can be browsed with `asv publish` / `asv preview`, but the `asv` CLI is **not**
used to run benchmarks — `driver.py` runs everything in-process.

## Prerequisites

- TransformerEngine must already be built and installed in the current Python environment.
- A ROCm or CUDA GPU must be available.
- `asv` is only required if you want the HTML dashboard (`pip install asv`).

## Running benchmarks

Each `bench_*.py` file is directly executable, or you can drive them through
`driver.py`. Results are saved to `benchmarks/.asv/results/` in ASV-compatible
format by default.

```bash
cd benchmarks/asv
python driver.py --all # run every suite
python driver.py bench_gemm # run one suite via driver
python bench_gemm.py # run one suite directly
python bench_gemm.py time_forward # filter to a specific method
python bench_gemm.py -w 5 -n 20 # custom warmup/iteration counts
python bench_casting.py --no-save # skip saving results
python bench_casting.py --cold-cache # flush cache before each sample
python bench_gemm.py --inner 50 # fix inner-loop count to 50
python bench_gemm.py --target-window-ms 5 # tune inner so each window >=5 ms
```

### Timing model: inner loop and cache state

Each `time_*` method runs the kernel `_inner` times inside a single CUDA event
window and divides by `_inner`, so kernel-launch and CUDA-event jitter
(`~0.5 µs` resolution on AMD) are amortized. By default the driver
**auto-tunes** `_inner` per (combo, method) so each window lasts at least
`--target-window-ms` (default `1.0 ms`):

| Flag | Effect |
|---|---|
| `--inner auto` (default) | Probe a single invocation, then pick `_inner` so the next timed window lasts ≥ `--target-window-ms`. Capped at 10000. |
| `--inner N` | Force a fixed `_inner = N` (overrides auto-tune). |
| `--target-window-ms T` | Target window duration for `--inner auto` (default `1.0`). |
| `--cold-cache` | Write a `--cache-flush-mb` byte scratch buffer before each sample to evict L2 + Infinity Cache. Implies `--inner=1` (otherwise iterations 2..N would refill the cache and the measurement degenerates back to warm-cache). |
| `--cache-flush-mb M` | Scratch buffer size for `--cold-cache` (default `256`, sized for the MI300 Infinity Cache). |

Choose the regime that matches the question you're asking:
- **Warm cache, large `_inner`** (default): steady-state kernel throughput,
matches what a hot inner loop in a model sees. Lowest variance.
- **Cold cache, `_inner=1`**: realistic cost of the kernel as an isolated
call into cold memory — closer to what `rocprofv3 --hip-trace` reports
on a freshly launched kernel. Higher variance; bandwidth-bound
benchmarks (cast, normalization) typically run 1.5–3× slower than warm.

Caveat: the inner loop runs in Python, so each iteration carries
~80–200 ns of interpreter overhead. For sub-microsecond kernels this is
not removable without CUDA graph capture; pick `--inner` deliberately
in that regime or use the cold-cache mode.

### Helper script

`run_benchmarks.sh` wraps common tasks and can be run from anywhere.

```bash
bash benchmarks/asv/run_benchmarks.sh <command> [options]
```

| Command | Description |
|---|---|
| `run [suite] [method]` | Run benchmarks in-process (saves ASV-compatible results) |
| `view` | Build the ASV HTML dashboard from saved results and serve it on `localhost:8080` |
| `list` | List available benchmark suites |

## How results are stored

ASV-format JSON files under `benchmarks/.asv/results/`:

```
benchmarks/.asv/results/
my-machine-name/
machine.json # Hardware/OS metadata (auto-generated by driver)
<commit-hash>.json # Timing results for that commit
<commit-hash>.json
...
```

Each commit JSON contains the wall-clock timings for every benchmark + parameter combination
run on that machine. The `benchmarks/.asv/` directory is in `.gitignore`.

## Viewing results

To browse historical results in a dashboard, point `asv` at the saved JSON:

```bash
bash benchmarks/asv/run_benchmarks.sh view
# or, manually:
asv publish --config benchmarks/asv/asv.conf.json
asv preview --config benchmarks/asv/asv.conf.json
```

`asv.conf.json` exists only to support `publish` / `preview`; benchmarks
themselves are not invoked through `asv`.

## Writing new benchmarks

Create a new file in `benchmarks/asv/` following the naming convention `bench_<name>.py`.

```python
#!/usr/bin/env python3
import torch
import transformer_engine.pytorch as te

class BenchSomething:
params = [[1024, 4096], ["config_a", "config_b"]]
param_names = ["M", "config"]
timeout = 300 # seconds, per parameter combination

# Driver overrides per (combo, method): _inner controls how many kernel
# invocations land in one CUDA event window; _scratch (when not None) is
# written to before each sample to evict the GPU cache.
_inner = 1
_scratch = None

def setup(self, M, config):
# Allocate tensors, create modules.
# This runs once per (combo, method); the same instance is reused for
# warmup and timed iterations.
self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
...

def time_forward(self, M, config):
# Use CUDA events for accurate GPU timing.
# Return elapsed seconds per single invocation — the driver uses this
# instead of wall time. Looping inside the event window amortizes
# CUDA event resolution and kernel-launch overhead.
if self._scratch is not None:
self._scratch.fill_(1.0) # cold-cache mode
self._evt[0].record()
for _ in range(self._inner):
self.module(self.x)
self._evt[1].record()
torch.cuda.synchronize()
return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner

# Optional: define work_<name> to get throughput columns (TFLOPS / GB/s).
def work_forward(self, M, config):
return {"flops": 2 * M * self.N * self.K} # compute-bound
# return {"bytes": M * self.hidden * 4} # memory-bound

if __name__ == "__main__":
from driver import run_as_main
run_as_main(__file__)
```

Key rules:
- Method names starting with `time_` are automatically timed.
- Use CUDA events and return elapsed seconds **per single invocation** —
divide the event delta by `self._inner` so the driver and the throughput
columns get per-call values regardless of inner-loop count.
- Honor `self._inner` (loop the kernel) and `self._scratch` (write before
recording the start event) so the driver's `--inner` and `--cold-cache`
flags work for your benchmark.
- Optionally define `work_<name>` companions to get TFLOPS or GB/s columns.
These return the per-call work, not per-window work.
- Clear `.grad` attributes in backward benchmarks to prevent memory accumulation.
- The `params` list defines a cross-product; keep the matrix size reasonable.
Empty file added benchmarks/asv/__init__.py
Empty file.
16 changes: 16 additions & 0 deletions benchmarks/asv/asv.conf.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
Comment thread
ipanfilo marked this conversation as resolved.
"version": 1,
"project": "TransformerEngine",
"project_url": "https://github.com/ROCm/TransformerEngine",
"repo": "../..",
"branches": ["HEAD"],
"environment_type": "existing",
"install_command": [],
"build_command": [],
"benchmark_dir": ".",
"results_dir": "../.asv/results",
"html_dir": "../.asv/html",
"install_timeout": 600,
"benchmark_timeout": 1200,
"launch_method": "spawn"
}
102 changes: 102 additions & 0 deletions benchmarks/asv/bench_attention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new attention microbenchmark, should we use the attention microbenchmark(s) already part of TE (in https://github.com/ROCm/TransformerEngine/tree/dev/benchmarks/attention)?

###############################################################################
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.
###############################################################################
"""
Attention micro-benchmark using te.DotProductAttention.

Benchmarks fused multi-head attention (with flash attention backend) for
model configurations with grouped-query attention (GQA).

Models:
- Llama 3 8B (TP=1, TP=8), 70B (TP=8), 405B (TP=8)
- Qwen 2.5 7B (TP=1), 72B (TP=8)

Forward FLOPs = 4 * batch * num_q_heads * seq_len^2 * head_dim
(two matmuls: Q@K^T and attn@V, each contributing 2*b*h*s^2*d)
Backward FLOPs = 2 * Forward FLOPs (approximately)

Sources for model configs:
https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json
https://huggingface.co/meta-llama/Llama-3.1-70B/blob/main/config.json
https://huggingface.co/meta-llama/Llama-3.1-405B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/config.json
"""

import torch
import transformer_engine.pytorch as te

BATCH = 2

# (num_q_heads, num_kv_heads, head_dim, tp)
MODELS = {
"Llama3-8B_TP1": (32, 8, 128, 1),
"Llama3-8B_TP8": (32, 8, 128, 8),
"Llama3-70B_TP8": (64, 8, 128, 8),
"Llama3-405B_TP8": (128, 8, 128, 8),
"Qwen2.5-7B_TP1": (28, 4, 128, 1),
"Qwen2.5-72B_TP8": (64, 8, 128, 8),
}


class BenchAttention:
params = [[1024, 2048, 4096, 8192], list(MODELS)]
param_names = ["seq_len", "model"]
timeout = 300
_inner = 1
_scratch = None

def setup(self, seq_len, model):
n_q, n_kv, hd, tp = MODELS[model]
qh, kvh = n_q // tp, n_kv // tp
dtype = torch.bfloat16

self.attn = te.DotProductAttention(
num_attention_heads=qh, kv_channels=hd,
num_gqa_groups=kvh, attn_mask_type="causal",
).to(device="cuda", dtype=dtype)

self.q = torch.randn(seq_len, BATCH, qh, hd, dtype=dtype, device="cuda", requires_grad=True)
self.k = torch.randn(seq_len, BATCH, kvh, hd, dtype=dtype, device="cuda", requires_grad=True)
self.v = torch.randn(seq_len, BATCH, kvh, hd, dtype=dtype, device="cuda", requires_grad=True)
self.grad_out = torch.randn_like(self.attn(self.q, self.k, self.v))
self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]

def work_forward(self, seq_len, model):
n_q, n_kv, hd, tp = MODELS[model]
qh = n_q // tp
return {"flops": 4 * BATCH * qh * seq_len * seq_len * hd}

def work_forward_backward(self, seq_len, model):
n_q, n_kv, hd, tp = MODELS[model]
qh = n_q // tp
return {"flops": 3 * 4 * BATCH * qh * seq_len * seq_len * hd}

def time_forward(self, seq_len, model):
if self._scratch is not None:
self._scratch.fill_(1.0)
self._evt[0].record()
for _ in range(self._inner):
self.attn(self.q, self.k, self.v)
self._evt[1].record()
torch.cuda.synchronize()
return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner

def time_forward_backward(self, seq_len, model):
if self._scratch is not None:
self._scratch.fill_(1.0)
self._evt[0].record()
for _ in range(self._inner):
out = self.attn(self.q, self.k, self.v)
out.backward(self.grad_out)
self._evt[1].record()
torch.cuda.synchronize()
self.q.grad = self.k.grad = self.v.grad = None
return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner

if __name__ == "__main__":
from driver import run_as_main
run_as_main(__file__)
100 changes: 100 additions & 0 deletions benchmarks/asv/bench_casting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/usr/bin/env python3
###############################################################################
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.
###############################################################################
"""
Benchmarks quantization (BF16 -> FP8) and dequantization (FP8 -> BF16) for
both E4M3 (activations/weights) and E5M2 (gradients) formats.

Shapes are (M, hidden_size) matching the activation tensors from models:
- Llama 3.1 8B, 70B, 405B
- Qwen 2.5 7B, 72B

These casts are memory-bound; we report GB/s (input + output bytes).

Sources for model configs:
https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json
https://huggingface.co/meta-llama/Llama-3.1-70B/blob/main/config.json
https://huggingface.co/meta-llama/Llama-3.1-405B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/config.json
"""

import torch
from transformer_engine.pytorch import Float8CurrentScalingQuantizer
from transformer_engine_torch import DType as TE_DType

HIDDEN_SIZES = {
"Llama3-8B": 4096,
"Llama3-70B": 8192,
"Llama3-405B": 16384,
"Qwen2.5-7B": 3584,
"Qwen2.5-72B": 8192,
}

CAST_CONFIGS = {
"BF16_to_E4M3": ("quantize", TE_DType.kFloat8E4M3),
"E4M3_to_BF16": ("dequantize", TE_DType.kFloat8E4M3),
"BF16_to_E5M2": ("quantize", TE_DType.kFloat8E5M2),
"E5M2_to_BF16": ("dequantize", TE_DType.kFloat8E5M2),
}


class BenchCasting:
params = [[1024, 2048, 4096, 8192], list(HIDDEN_SIZES), list(CAST_CONFIGS)]
param_names = ["M", "model", "cast"]
timeout = 120
# Driver overrides these per (combo, method): _inner is the number of
# kernel invocations per CUDA event window (amortizes launch overhead);
# _scratch, when not None, is fill_()ed before each sample to evict the
# GPU cache.
_inner = 1
_scratch = None

def setup(self, M, model, cast):
hidden = HIDDEN_SIZES[model]
direction, fp8_dtype = CAST_CONFIGS[cast]
self.direction = direction
quantizer = Float8CurrentScalingQuantizer(
fp8_dtype=fp8_dtype,
device=torch.device("cuda"),
rowwise=True,
columnwise=False,
)
if direction == "dequantize":
bf16_tensor = torch.randn(M, hidden, dtype=torch.bfloat16, device="cuda")
self.x = quantizer.quantize(bf16_tensor)
else:
self.x = torch.randn(M, hidden, dtype=torch.bfloat16, device="cuda")
self.quantizer = quantizer
self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]

def work_cast(self, M, model, cast):
hidden = HIDDEN_SIZES[model]
direction = CAST_CONFIGS[cast][0]
if direction == "quantize":
# Read BF16 (2B) + write FP8 (1B) + write scale
return {"bytes": M * hidden * 3}
else:
# Read FP8 (1B) + read scale + write BF16 (2B)
return {"bytes": M * hidden * 3}

def time_cast(self, M, model, cast):
if self._scratch is not None:
self._scratch.fill_(1.0)
self._evt[0].record()
if self.direction == "quantize":
for _ in range(self._inner):
self.quantizer.quantize(self.x)
else:
for _ in range(self._inner):
self.x.dequantize(dtype=torch.bfloat16)
self._evt[1].record()
torch.cuda.synchronize()
return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner

if __name__ == "__main__":
from driver import run_as_main
run_as_main(__file__)
Loading