Skip to content

use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm#1031

Merged
zhuyuhua-v merged 3 commits into
mainfrom
yuhua/fp4-triton-gemm
Jun 3, 2026
Merged

use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm#1031
zhuyuhua-v merged 3 commits into
mainfrom
yuhua/fp4-triton-gemm

Conversation

@zhuyuhua-v
Copy link
Copy Markdown
Collaborator

@zhuyuhua-v zhuyuhua-v commented Jun 2, 2026

Motivation

As shown below, atom use preshuffled asm gemm as default path for fp4 gemm path while mori-sglang using non-shuffle gemm for better performance, plus, triton preshuffle fp4 gemm got a worse perf:
image

image Hence, this pr add a flag `ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM=1` to control whether to use triton non-shuffle fp4 gemm to imporve decode(small bs) cases' perf.

Test Plan

benchmark + gsm8k

Test Result

https://github.com/ROCm/ATOM/actions/runs/26651782340 for benchmark test:
image

image for accuracy test

Submission Checklist

@zhuyuhua-v zhuyuhua-v force-pushed the yuhua/fp4-triton-gemm branch from 5375206 to c3e9ae0 Compare June 2, 2026 07:14
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

ruff

⚠️ [ruff] reported by reviewdog 🐶
unindent does not match any outer indentation level

@mark_trace

@zhuyuhua-v zhuyuhua-v force-pushed the yuhua/fp4-triton-gemm branch from c3e9ae0 to 5603464 Compare June 2, 2026 07:16
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
@zhuyuhua-v zhuyuhua-v force-pushed the yuhua/fp4-triton-gemm branch from 5603464 to 6d85564 Compare June 2, 2026 07:26
@zhuyuhua-v zhuyuhua-v changed the title add ATOM_USE_FP4_TRITON_GEMM=1 for 1/1024 cases use ATOM_USE_FP4_TRITON_GEMM to enable non shuffle triton gemm Jun 2, 2026
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
XiaobingSuper
XiaobingSuper previously approved these changes Jun 2, 2026
@zhuyuhua-v zhuyuhua-v marked this pull request as ready for review June 2, 2026 08:46
Copilot AI review requested due to automatic review settings June 2, 2026 08:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an environment-variable-controlled switch to use AITER’s non-shuffled Triton FP4 GEMM path (intended to improve small-batch/decode performance) and wires it into the FP4 GEMM dispatch and weight post-processing logic.

Changes:

  • Document a new env var to enable non-shuffle Triton FP4 GEMM behavior.
  • Add env var plumbing in atom.utils.envs.
  • Update FP4 GEMM selection logic and weight/scale shuffle handling in atom/model_ops/linear.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
docs/environment_variables.md Documents the new env var and its precedence relative to ATOM_USE_TRITON_GEMM.
atom/utils/envs.py Adds the new env var accessor (with suggested aliasing to match PR text).
atom/model_ops/linear.py Adds non-shuffle Triton FP4 GEMM dispatch and adjusts shuffling behavior for FP4 weights/scales.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/utils/envs.py
Comment on lines +131 to +133
"ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM": lambda: (
os.getenv("ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM", "0") == "1"
),
Comment thread atom/model_ops/linear.py Outdated
Comment thread atom/model_ops/linear.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 2, 2026 09:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines 39 to +40
| **ATOM_USE_TRITON_GEMM** | bool | 0 (false) | If set to `1`, use AITER Triton FP4 weight preshuffled GEMM. Otherwise use AITER ASM FP4 weight preshuffled GEMM. |
| **ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM** | bool | 0 (false) | If set to `1`, use AITER Triton FP4 GEMM with non-shuffled weights. Takes precedence over the FP4 preshuffled GEMM path selected by `ATOM_USE_TRITON_GEMM`. |
Comment thread atom/utils/envs.py
Comment on lines +131 to +133
"ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM": lambda: (
os.getenv("ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM", "0") == "1"
),
Comment thread atom/model_ops/linear.py
Comment on lines +45 to +56
def use_fp4_non_shuffle_triton_gemm() -> bool:
return envs.ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM


if use_fp4_non_shuffle_triton_gemm():
try:
from aiter.ops.triton.gemm_afp4wfp4 import gemm_afp4wfp4 # noqa: E402
except ImportError as e:
logger.warning(f"Triton FP4 GEMM not available: {e}")
gemm_afp4wfp4 = None
else:
gemm_afp4wfp4 = None
@zhuyuhua-v zhuyuhua-v changed the title use ATOM_USE_FP4_TRITON_GEMM to enable non shuffle triton gemm use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm Jun 3, 2026
@zhuyuhua-v zhuyuhua-v merged commit c42ab44 into main Jun 3, 2026
72 of 88 checks passed
@zhuyuhua-v zhuyuhua-v deleted the yuhua/fp4-triton-gemm branch June 3, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants