ROCm · Micky774 · Mar 16, 2026 · Mar 17, 2026 · Mar 18, 2026 · Mar 19, 2026
@@ -56,4 +56,5 @@ artifacts/
 **/times.csv
 transformer_engine/build_info.txt
 transformer_engine/common/util/hip_nvml.*
+.asv/
 *.DS_Store
@@ -0,0 +1,168 @@
+# Benchmarks for TransformerEngine
+
+GPU microbenchmarks driven by `driver.py`. Results are written in
+[ASV (Air Speed Velocity)](https://asv.readthedocs.io/) JSON format so they
+can be browsed with `asv publish` / `asv preview`, but the `asv` CLI is **not**
+used to run benchmarks — `driver.py` runs everything in-process.
+
+## Prerequisites
+
+- TransformerEngine must already be built and installed in the current Python environment.
+- A ROCm or CUDA GPU must be available.
+- `asv` is only required if you want the HTML dashboard (`pip install asv`).
+
+## Running benchmarks
+
+Each `bench_*.py` file is directly executable, or you can drive them through
+`driver.py`. Results are saved to `benchmarks/.asv/results/` in ASV-compatible
+format by default.
+
+```bash
+cd benchmarks/asv
+python driver.py --all                      # run every suite
+python driver.py bench_gemm                 # run one suite via driver
+python bench_gemm.py                        # run one suite directly
+python bench_gemm.py time_forward           # filter to a specific method
+python bench_gemm.py -w 5 -n 20             # custom warmup/iteration counts
+python bench_casting.py --no-save           # skip saving results
+python bench_casting.py --cold-cache        # flush cache before each sample
+python bench_gemm.py --inner 50             # fix inner-loop count to 50
+python bench_gemm.py --target-window-ms 5   # tune inner so each window >=5 ms
+```
+
+### Timing model: inner loop and cache state
+
+Each `time_*` method runs the kernel `_inner` times inside a single CUDA event
+window and divides by `_inner`, so kernel-launch and CUDA-event jitter
+(`~0.5 µs` resolution on AMD) are amortized. By default the driver
+**auto-tunes** `_inner` per (combo, method) so each window lasts at least
+`--target-window-ms` (default `1.0 ms`):
+
+| Flag | Effect |
+|---|---|
+| `--inner auto` (default) | Probe a single invocation, then pick `_inner` so the next timed window lasts ≥ `--target-window-ms`. Capped at 10000. |
+| `--inner N` | Force a fixed `_inner = N` (overrides auto-tune). |
+| `--target-window-ms T` | Target window duration for `--inner auto` (default `1.0`). |
+| `--cold-cache` | Write a `--cache-flush-mb` byte scratch buffer before each sample to evict L2 + Infinity Cache. Implies `--inner=1` (otherwise iterations 2..N would refill the cache and the measurement degenerates back to warm-cache). |
+| `--cache-flush-mb M` | Scratch buffer size for `--cold-cache` (default `256`, sized for the MI300 Infinity Cache). |
+
+Choose the regime that matches the question you're asking:
+- **Warm cache, large `_inner`** (default): steady-state kernel throughput,
+  matches what a hot inner loop in a model sees. Lowest variance.
+- **Cold cache, `_inner=1`**: realistic cost of the kernel as an isolated
+  call into cold memory — closer to what `rocprofv3 --hip-trace` reports
+  on a freshly launched kernel. Higher variance; bandwidth-bound
+  benchmarks (cast, normalization) typically run 1.5–3× slower than warm.
+
+Caveat: the inner loop runs in Python, so each iteration carries
+~80–200 ns of interpreter overhead. For sub-microsecond kernels this is
+not removable without CUDA graph capture; pick `--inner` deliberately
+in that regime or use the cold-cache mode.
+
+### Helper script
+
+`run_benchmarks.sh` wraps common tasks and can be run from anywhere.
+
+```bash
+bash benchmarks/asv/run_benchmarks.sh <command> [options]
+```
+
+| Command | Description |
+|---|---|
+| `run [suite] [method]` | Run benchmarks in-process (saves ASV-compatible results) |
+| `view` | Build the ASV HTML dashboard from saved results and serve it on `localhost:8080` |
+| `list` | List available benchmark suites |
+
+## How results are stored
+
+ASV-format JSON files under `benchmarks/.asv/results/`:
+
+```
+benchmarks/.asv/results/
+  my-machine-name/
+    machine.json           # Hardware/OS metadata (auto-generated by driver)
+    <commit-hash>.json     # Timing results for that commit
+    <commit-hash>.json
+    ...
+```
+
+Each commit JSON contains the wall-clock timings for every benchmark + parameter combination
+run on that machine. The `benchmarks/.asv/` directory is in `.gitignore`.
+
+## Viewing results
+
+To browse historical results in a dashboard, point `asv` at the saved JSON:
+
+```bash
+bash benchmarks/asv/run_benchmarks.sh view
+# or, manually:
+asv publish --config benchmarks/asv/asv.conf.json
+asv preview --config benchmarks/asv/asv.conf.json
+```
+
+`asv.conf.json` exists only to support `publish` / `preview`; benchmarks
+themselves are not invoked through `asv`.
+
+## Writing new benchmarks
+
+Create a new file in `benchmarks/asv/` following the naming convention `bench_<name>.py`.
+
+```python
+#!/usr/bin/env python3
+import torch
+import transformer_engine.pytorch as te
+
+class BenchSomething:
+    params = [[1024, 4096], ["config_a", "config_b"]]
+    param_names = ["M", "config"]
+    timeout = 300  # seconds, per parameter combination
+
+    # Driver overrides per (combo, method): _inner controls how many kernel
+    # invocations land in one CUDA event window; _scratch (when not None) is
+    # written to before each sample to evict the GPU cache.
+    _inner = 1
+    _scratch = None
+
+    def setup(self, M, config):
+        # Allocate tensors, create modules.
+        # This runs once per (combo, method); the same instance is reused for
+        # warmup and timed iterations.
+        self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
+        ...
+
+    def time_forward(self, M, config):
+        # Use CUDA events for accurate GPU timing.
+        # Return elapsed seconds per single invocation — the driver uses this
+        # instead of wall time. Looping inside the event window amortizes
+        # CUDA event resolution and kernel-launch overhead.
+        if self._scratch is not None:
+            self._scratch.fill_(1.0)        # cold-cache mode
+        self._evt[0].record()
+        for _ in range(self._inner):
+            self.module(self.x)
+        self._evt[1].record()
+        torch.cuda.synchronize()
+        return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner
+
+    # Optional: define work_<name> to get throughput columns (TFLOPS / GB/s).
+    def work_forward(self, M, config):
+        return {"flops": 2 * M * self.N * self.K}   # compute-bound
+        # return {"bytes": M * self.hidden * 4}      # memory-bound
+
+if __name__ == "__main__":
+    from driver import run_as_main
+    run_as_main(__file__)
+```
+
+Key rules:
+- Method names starting with `time_` are automatically timed.
+- Use CUDA events and return elapsed seconds **per single invocation** —
+  divide the event delta by `self._inner` so the driver and the throughput
+  columns get per-call values regardless of inner-loop count.
+- Honor `self._inner` (loop the kernel) and `self._scratch` (write before
+  recording the start event) so the driver's `--inner` and `--cold-cache`
+  flags work for your benchmark.
+- Optionally define `work_<name>` companions to get TFLOPS or GB/s columns.
+  These return the per-call work, not per-window work.
+- Clear `.grad` attributes in backward benchmarks to prevent memory accumulation.
+- The `params` list defines a cross-product; keep the matrix size reasonable.
@@ -0,0 +1,16 @@
+{
+    "version": 1,
+    "project": "TransformerEngine",
+    "project_url": "https://github.com/ROCm/TransformerEngine",
+    "repo": "../..",
+    "branches": ["HEAD"],
+    "environment_type": "existing",
+    "install_command": [],
+    "build_command": [],
+    "benchmark_dir": ".",
+    "results_dir": "../.asv/results",
+    "html_dir": "../.asv/html",
+    "install_timeout": 600,
+    "benchmark_timeout": 1200,
+    "launch_method": "spawn"
+}
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+###############################################################################
+# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
+#
+# See LICENSE for license information.
+###############################################################################
+"""
+Attention micro-benchmark using te.DotProductAttention.
+
+Benchmarks fused multi-head attention (with flash attention backend) for
+model configurations with grouped-query attention (GQA).
+
+Models:
+  - Llama 3   8B (TP=1, TP=8), 70B (TP=8), 405B (TP=8)
+  - Qwen 2.5  7B (TP=1), 72B (TP=8)
+
+Forward FLOPs = 4 * batch * num_q_heads * seq_len^2 * head_dim
+  (two matmuls: Q@K^T and attn@V, each contributing 2*b*h*s^2*d)
+Backward FLOPs = 2 * Forward FLOPs (approximately)
+
+Sources for model configs:
+  https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json
+  https://huggingface.co/meta-llama/Llama-3.1-70B/blob/main/config.json
+  https://huggingface.co/meta-llama/Llama-3.1-405B/blob/main/config.json
+  https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json
+  https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/config.json
+"""
+
+import torch
+import transformer_engine.pytorch as te
+
+BATCH = 2
+
+# (num_q_heads, num_kv_heads, head_dim, tp)
+MODELS = {
+    "Llama3-8B_TP1":   (32, 8, 128, 1),
+    "Llama3-8B_TP8":   (32, 8, 128, 8),
+    "Llama3-70B_TP8":  (64, 8, 128, 8),
+    "Llama3-405B_TP8": (128, 8, 128, 8),
+    "Qwen2.5-7B_TP1":  (28, 4, 128, 1),
+    "Qwen2.5-72B_TP8": (64, 8, 128, 8),
+}
+
+
+class BenchAttention:
+    params = [[1024, 2048, 4096, 8192], list(MODELS)]
+    param_names = ["seq_len", "model"]
+    timeout = 300
+    _inner = 1
+    _scratch = None
+
+    def setup(self, seq_len, model):
+        n_q, n_kv, hd, tp = MODELS[model]
+        qh, kvh = n_q // tp, n_kv // tp
+        dtype = torch.bfloat16
+
+        self.attn = te.DotProductAttention(
+            num_attention_heads=qh, kv_channels=hd,
+            num_gqa_groups=kvh, attn_mask_type="causal",
+        ).to(device="cuda", dtype=dtype)
+
+        self.q = torch.randn(seq_len, BATCH, qh, hd, dtype=dtype, device="cuda", requires_grad=True)
+        self.k = torch.randn(seq_len, BATCH, kvh, hd, dtype=dtype, device="cuda", requires_grad=True)
+        self.v = torch.randn(seq_len, BATCH, kvh, hd, dtype=dtype, device="cuda", requires_grad=True)
+        self.grad_out = torch.randn_like(self.attn(self.q, self.k, self.v))
+        self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
+
+    def work_forward(self, seq_len, model):
+        n_q, n_kv, hd, tp = MODELS[model]
+        qh = n_q // tp
+        return {"flops": 4 * BATCH * qh * seq_len * seq_len * hd}
+
+    def work_forward_backward(self, seq_len, model):
+        n_q, n_kv, hd, tp = MODELS[model]
+        qh = n_q // tp
+        return {"flops": 3 * 4 * BATCH * qh * seq_len * seq_len * hd}
+
+    def time_forward(self, seq_len, model):
+        if self._scratch is not None:
+            self._scratch.fill_(1.0)
+        self._evt[0].record()
+        for _ in range(self._inner):
+            self.attn(self.q, self.k, self.v)
+        self._evt[1].record()
+        torch.cuda.synchronize()
+        return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner
+
+    def time_forward_backward(self, seq_len, model):
+        if self._scratch is not None:
+            self._scratch.fill_(1.0)
+        self._evt[0].record()
+        for _ in range(self._inner):
+            out = self.attn(self.q, self.k, self.v)
+            out.backward(self.grad_out)
+        self._evt[1].record()
+        torch.cuda.synchronize()
+        self.q.grad = self.k.grad = self.v.grad = None
+        return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner
+
+if __name__ == "__main__":
+    from driver import run_as_main
+    run_as_main(__file__)
@@ -0,0 +1,100 @@
+#!/usr/bin/env python3
+###############################################################################
+# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
+#
+# See LICENSE for license information.
+###############################################################################
+"""
+Benchmarks quantization (BF16 -> FP8) and dequantization (FP8 -> BF16) for
+both E4M3 (activations/weights) and E5M2 (gradients) formats.
+
+Shapes are (M, hidden_size) matching the activation tensors from models:
+  - Llama 3.1 8B, 70B, 405B
+  - Qwen 2.5  7B, 72B
+
+These casts are memory-bound; we report GB/s (input + output bytes).
+
+Sources for model configs:
+  https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json
+  https://huggingface.co/meta-llama/Llama-3.1-70B/blob/main/config.json
+  https://huggingface.co/meta-llama/Llama-3.1-405B/blob/main/config.json
+  https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json
+  https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/config.json
+"""
+
+import torch
+from transformer_engine.pytorch import Float8CurrentScalingQuantizer
+from transformer_engine_torch import DType as TE_DType
+
+HIDDEN_SIZES = {
+    "Llama3-8B": 4096,
+    "Llama3-70B": 8192,
+    "Llama3-405B": 16384,
+    "Qwen2.5-7B": 3584,
+    "Qwen2.5-72B": 8192,
+}
+
+CAST_CONFIGS = {
+    "BF16_to_E4M3": ("quantize", TE_DType.kFloat8E4M3),
+    "E4M3_to_BF16": ("dequantize", TE_DType.kFloat8E4M3),
+    "BF16_to_E5M2": ("quantize", TE_DType.kFloat8E5M2),
+    "E5M2_to_BF16": ("dequantize", TE_DType.kFloat8E5M2),
+}
+
+
+class BenchCasting:
+    params = [[1024, 2048, 4096, 8192], list(HIDDEN_SIZES), list(CAST_CONFIGS)]
+    param_names = ["M", "model", "cast"]
+    timeout = 120
+    # Driver overrides these per (combo, method): _inner is the number of
+    # kernel invocations per CUDA event window (amortizes launch overhead);
+    # _scratch, when not None, is fill_()ed before each sample to evict the
+    # GPU cache.
+    _inner = 1
+    _scratch = None
+
+    def setup(self, M, model, cast):
+        hidden = HIDDEN_SIZES[model]
+        direction, fp8_dtype = CAST_CONFIGS[cast]
+        self.direction = direction
+        quantizer = Float8CurrentScalingQuantizer(
+            fp8_dtype=fp8_dtype,
+            device=torch.device("cuda"),
+            rowwise=True,
+            columnwise=False,
+        )
+        if direction == "dequantize":
+            bf16_tensor = torch.randn(M, hidden, dtype=torch.bfloat16, device="cuda")
+            self.x = quantizer.quantize(bf16_tensor)
+        else:
+            self.x = torch.randn(M, hidden, dtype=torch.bfloat16, device="cuda")
+            self.quantizer = quantizer
+        self._evt = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
+
+    def work_cast(self, M, model, cast):
+        hidden = HIDDEN_SIZES[model]
+        direction = CAST_CONFIGS[cast][0]
+        if direction == "quantize":
+            # Read BF16 (2B) + write FP8 (1B) + write scale
+            return {"bytes": M * hidden * 3}
+        else:
+            # Read FP8 (1B) + read scale + write BF16 (2B)
+            return {"bytes": M * hidden * 3}
+
+    def time_cast(self, M, model, cast):
+        if self._scratch is not None:
+            self._scratch.fill_(1.0)
+        self._evt[0].record()
+        if self.direction == "quantize":
+            for _ in range(self._inner):
+                self.quantizer.quantize(self.x)
+        else:
+            for _ in range(self._inner):
+                self.x.dequantize(dtype=torch.bfloat16)
+        self._evt[1].record()
+        torch.cuda.synchronize()
+        return self._evt[0].elapsed_time(self._evt[1]) / 1000 / self._inner
+
+if __name__ == "__main__":
+    from driver import run_as_main
+    run_as_main(__file__)