Added fuse_rms_norm lowering #4017

cehongwang · 2026-01-16T21:27:53Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

apbose · 2026-01-23T20:48:26Z

py/torch_tensorrt/dynamo/lowering/passes/replace_fused_rms_norm.py

+
+    gm.graph.erase_node(node)
+
+    return x_normalized


In the replacement with the normalized, in line 80, should we also check for the second return value of the op? The rstd as we see in the test cases. Though yeah I think application wise, its mainly used for the gradient, but to be consistent with the op signature

yeah I think it should have two Tensor outputs per https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml

zewenli98

The PR looks good overall. But what I'm thinking is that, comparing with the converter-style implementation, the ops are almost same except impl.slice.expand and a few tensor casting. Do you know 1) which op/layer causes the perf discrepancy (60% as you mentioned in the last meeting)? 2) do these two approaches build the same size engines? If we can identify the issue, we probably can optimize other ops as well.

zewenli98 · 2026-01-26T17:44:49Z

py/torch_tensorrt/dynamo/lowering/passes/replace_fused_rms_norm.py

+                # If the getitem is extracting the first element (the output tensor)
+                if not x_normalized.meta:
+                    x_normalized.meta = copy.copy(node.meta)
+                user.replace_all_uses_with(x_normalized)


As Naren previously mentioned, can you add a log here when each node is changed?

zewenli98 · 2026-01-26T17:49:31Z

py/torch_tensorrt/dynamo/lowering/passes/replace_fused_rms_norm.py

+
+    gm.graph.erase_node(node)
+
+    return x_normalized


yeah I think it should have two Tensor outputs per https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml

Co-authored-by: cehongwang <wangcehong@gmail.com>

py/torch_tensorrt/dynamo/lowering/passes/_aten_lowering_pass.py

meta-cla bot added the cla signed label Jan 16, 2026

github-actions bot added component: lowering Issues re: The lowering / preprocessing passes component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jan 16, 2026

github-actions bot requested a review from zewenli98 January 16, 2026 21:28

cehongwang force-pushed the fused_rms_norm_lowering branch from 2bd6654 to 55010b6 Compare January 17, 2026 03:09

github-actions bot added the component: tests Issues re: Tests label Jan 20, 2026

narendasan requested review from apbose and lanluo-nvidia January 21, 2026 01:40

apbose reviewed Jan 23, 2026

View reviewed changes

zewenli98 reviewed Jan 26, 2026

View reviewed changes

cehongwang linked an issue Jan 27, 2026 that may be closed by this pull request

✨[Feature] Fused_Rms_Norm Lowering #4048

Open

cehongwang force-pushed the fused_rms_norm_lowering branch from 595aa5e to 6d99cd6 Compare January 27, 2026 20:17

cehongwang and others added 5 commits January 28, 2026 00:01

Added fuse_rms_norm lowering

4385c1a

changed lowering pass to post lowering and implemented rms_norm

3769dfd

Added meta info; returned 2 nodes; added debug log

26a3d37

Fixed the tests

3e6d6f6

Recapture fake tensor meta data for inserted subgraphs (#4052)

a3ed926

Co-authored-by: cehongwang <wangcehong@gmail.com>

cehongwang force-pushed the fused_rms_norm_lowering branch from 121d636 to a3ed926 Compare January 28, 2026 00:01

narendasan reviewed Jan 28, 2026

View reviewed changes

py/torch_tensorrt/dynamo/lowering/passes/_aten_lowering_pass.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fuse_rms_norm lowering #4017

Added fuse_rms_norm lowering #4017

cehongwang commented Jan 16, 2026

Uh oh!

apbose Jan 23, 2026

Uh oh!

zewenli98 Jan 26, 2026

Uh oh!

zewenli98 left a comment

Uh oh!

zewenli98 Jan 26, 2026

Uh oh!

zewenli98 Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Added fuse_rms_norm lowering #4017

Are you sure you want to change the base?

Added fuse_rms_norm lowering #4017

Conversation

cehongwang commented Jan 16, 2026

Description

Type of change

Checklist:

Uh oh!

apbose Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

zewenli98 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

zewenli98 left a comment

Choose a reason for hiding this comment

Uh oh!

zewenli98 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

zewenli98 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants