[PyTorch] Enable post-RHT amax estimation #2442

negvet · 2025-12-02T15:29:46Z

Description

Post-RHT amax can be estimated from pre-RHT amax.

This PR optimizes out post-RHT amax (RHT+amax) kernel, enabling estimation of post-RHT amax from pre-RHT amax with linear scaling.
amax fusion is required to see perf benefits.
The feature is opt-in via NVTE_NVFP4_POST_RHT_AMAX_ESTIMATION=1.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2025-12-02T16:09:48Z

Greptile Overview

Greptile Summary

This PR introduces an experimental optimization for NVFP4 quantization that estimates post-RHT (Random Hadamard Transform) amax from pre-RHT amax using a configurable linear scale factor, eliminating the need for a separate RHT+amax kernel launch.

Key Changes:

Adds amax_estimation_scale configuration parameter throughout the quantization pipeline (C++ structs, Python dataclasses, and CUDA kernels)
Modifies the RHT cast fusion kernel to apply the estimation scale when computing global encode scale
Updates activation, bias, and normalization extensions to use fused paths when amax estimation is enabled
Feature is opt-in via NVTE_NVFP4_POST_RHT_AMAX_ESTIMATION=1 environment variable
Default scale factors: 2.0 for forward input activations, 1.0 for backward gradients

Performance Impact:

Reduces kernel launch overhead by skipping the RHT+amax kernel when amax fusion is available
Trade-off: estimated amax may affect numerical accuracy compared to true post-RHT amax

Confidence Score: 4/5

This PR is safe to merge - it's an opt-in experimental feature behind an environment variable that doesn't affect default behavior.
The implementation is well-structured with consistent changes across C++, CUDA, and Python layers. The feature is properly gated behind an environment variable. The code follows existing patterns in the codebase. The only concern is ensuring the fallback path (non-fused kernel) correctly handles the amax estimation when inputs don't meet fusion kernel requirements.
transformer_engine/pytorch/csrc/quantizer.cpp - verify the fallback path correctly applies amax estimation when the fused RHT kernel cannot be used.

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/hadamard_transform/hadamard_transform_cast_fusion.cu	5/5	Core kernel changes: passes `amax_scale` through the RHT+cast fusion kernel to multiply global_amax by the estimation scale factor in epilogue.
transformer_engine/common/recipe/init.py	4/5	Adds amax estimation configuration to NVFP4BlockScaling recipe with env var controls. Docstrings match implementation defaults (2.0 for fwd, 1.0 for bwd).
transformer_engine/pytorch/csrc/quantizer.cpp	4/5	Core quantizer changes: handles amax estimation by computing pre-RHT amax and passing scale to fusion kernel. Adds fallback path that computes pre-RHT amax when estimation is enabled.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	5/5	Adds `amax_estimation_scale` parameter to NVFP4Quantizer class and propagates it through copy() method.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Recipe as NVFP4BlockScaling Recipe
    participant Quantizer as NVFP4Quantizer
    participant Activation as Activation/Norm Extension
    participant Kernel as CUDA Kernel

    User->>Recipe: Create recipe with use_post_rht_amax_estimation=True
    Recipe->>Recipe: Set amax_estimation_scale (2.0 fwd, 1.0 bwd)
    Recipe->>Quantizer: Pass amax_estimation_scale via QParams
    
    alt Fused Path (amax estimation enabled)
        Activation->>Activation: Select FUSED_ACTIVATION_AMAX_NVFP4 impl
        Activation->>Kernel: Compute activation + pre-RHT amax
        Kernel-->>Quantizer: Return pre-RHT amax
        Quantizer->>Kernel: RHT cast fusion with amax_scale
        Kernel->>Kernel: global_amax_val = pre_rht_amax * amax_scale
        Kernel->>Kernel: Compute FP4 quantization with scaled amax
    else Unfused Path (true post-RHT amax)
        Activation->>Activation: Select UNFUSED impl
        Quantizer->>Kernel: nvte_hadamard_transform_amax
        Kernel-->>Quantizer: Return true post-RHT amax
        Quantizer->>Kernel: Quantize with true amax
    end

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/recipe/__init__.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

qijiaxing · 2025-12-04T03:32:14Z

Interesting idea. But how to set these two estimation scales?

Updated the default scale factor for forward input activations in post-RHT amax estimation to 2.0. Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

negvet · 2025-12-05T12:33:34Z

how to set these two estimation scales?

One way to go it is to estimate real distribution of the data (i.e. how amax is affected by RHT). From my experiments, I observe that amax(RHT(X)) / amax(X) is up to 2.0 and amax(RHT(G)) / amax(G) is up to 1.0 - so setting scales to 2.0 and 1.0 is an option.

In practice, quite wide range of scales is actually working well, due to wide dynamic range of e4m3 (amax misestimation is getting cancelled out eventually if stay within e4).

negvet · 2025-12-09T09:10:00Z

/te-ci

greptile-apps

Additional Comments (1)

transformer_engine/pytorch/csrc/extensions/activation.cpp, line 45-54 (link)

style: Identical logic block duplicated in both forward and backward paths - consider extracting this decision logic into a helper function to avoid code duplication

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Evgeny <etsykunov@nvidia.com>

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

negvet and others added 2 commits December 2, 2025 15:24

Enable post-RHT amax estimation

67ce8a5

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

47b8658

for more information, see https://pre-commit.ci

negvet marked this pull request as ready for review December 2, 2025 16:07

negvet requested review from ksivaman, ptrendx and timmoon10 December 2, 2025 16:07

greptile-apps bot reviewed Dec 2, 2025

View reviewed changes

transformer_engine/common/recipe/__init__.py Show resolved Hide resolved

negvet and others added 2 commits December 3, 2025 15:23

Update transformer_engine/common/recipe/__init__.py

7b14363

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

Merge branch 'main' into post_rht_amax_estimation

07dad38

greptile-apps bot reviewed Dec 3, 2025

View reviewed changes

Change default X scale for post-RHT amax estimation to 2.0

65ac610

Updated the default scale factor for forward input activations in post-RHT amax estimation to 2.0. Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

Merge branch 'main' into post_rht_amax_estimation

9f36741

greptile-apps bot reviewed Dec 10, 2025

View reviewed changes

Pointing to the same buffer

4310ef0

Signed-off-by: Evgeny <etsykunov@nvidia.com>

greptile-apps bot reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Enable post-RHT amax estimation #2442

[PyTorch] Enable post-RHT amax estimation #2442

negvet commented Dec 2, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

qijiaxing commented Dec 4, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

negvet commented Dec 5, 2025 •

edited

Loading

Uh oh!

negvet commented Dec 9, 2025

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[PyTorch] Enable post-RHT amax estimation #2442

Are you sure you want to change the base?

[PyTorch] Enable post-RHT amax estimation #2442

Conversation

negvet commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

qijiaxing commented Dec 4, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

negvet commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negvet commented Dec 9, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

negvet commented Dec 2, 2025 •

edited

Loading

greptile-apps bot commented Dec 2, 2025 •

edited

Loading

negvet commented Dec 5, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading