[TRTLLM-12288][feat] Support NVFP4 W4A16 inference on Hopper for Nemotron H models by tijyojwad · Pull Request #14009 · NVIDIA/TensorRT-LLM

tijyojwad · 2026-05-11T21:26:06Z

On GPUs without FP4 tensor cores (sm < 100, e.g. Hopper), dequantize NVFP4 weights to BF16 and use standard matmul instead of nvfp4_gemm. Activations remain in BF16 throughout.

Changes:

Add NVFP4W4A16LinearMethod: inherits NVFP4 weight storage, overrides apply() to dequant weights to BF16 + F.linear
Route get_quant_method() to W4A16 method when sm < 100
Guard is_nvfp4 in NemotronHLayer with sm >= 100 to disable fused RMSNorm+NVFP4 and Fp4QuantizedTensor on Hopper
MoE on Hopper: override quant config to unquantized, wrap load_weights to dequant NVFP4 expert weights to BF16 at load time
Add unit tests for dequant, linear forward, routing, and MoE dequant

Summary by CodeRabbit

New Features
- Added NVFP4 quantization fallback support for older GPUs, enabling inference on a broader range of hardware by automatically converting quantized weights during model loading.
Tests
- Added comprehensive test coverage for NVFP4 dequantization and fallback inference behavior.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

On GPUs without FP4 tensor cores (sm < 100, e.g. Hopper), dequantize NVFP4 weights to BF16 and use standard matmul instead of nvfp4_gemm. Activations remain in BF16 throughout. Changes: - Add NVFP4W4A16LinearMethod: inherits NVFP4 weight storage, overrides apply() to dequant weights to BF16 + F.linear - Route get_quant_method() to W4A16 method when sm < 100 - Guard is_nvfp4 in NemotronHLayer with sm >= 100 to disable fused RMSNorm+NVFP4 and Fp4QuantizedTensor on Hopper - MoE on Hopper: override quant config to unquantized, wrap load_weights to dequant NVFP4 expert weights to BF16 at load time - Add unit tests for dequant, linear forward, routing, and MoE dequant Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Move duplicated E2M1_VALUES lookup table and dequantization logic from NVFP4W4A16LinearMethod (linear.py) and NemotronHMOE (modeling_nemotron_h.py) into a shared dequantize_nvfp4() function in fp4_utils.py. This makes the FP4 dequant utility reusable by any module that needs NVFP4 weight dequantization without duplicating the E2M1 LUT and nibble-unpacking code. Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-05-11T21:33:49Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 40ec62ea-585a-4a01-a811-dcd045e676e1

📥 Commits

Reviewing files that changed from the base of the PR and between e197b69 and 8ce17ef.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_nemotron_h.py
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/quantization/utils/fp4_utils.py
tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py

👮 Files not reviewed due to content moderation or server errors (4)

tensorrt_llm/quantization/utils/fp4_utils.py
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py
tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py

📝 Walkthrough

Walkthrough

This PR extends NVFP4 quantization support to GPUs without FP4 tensor cores by introducing a fallback dequantization path. NVFP4 weights are dequantized to BF16 at load or inference time, enabling W4A16 computation on legacy hardware (SM < 100) via a new linear method, with SM-aware routing and MoE integration.

Changes

NVFP4 W4A16 Legacy GPU Support

Layer / File(s)	Summary
E2M1 Codebook & NVFP4 Dequantization Utilities `tensorrt_llm/quantization/utils/fp4_utils.py`	Introduces `E2M1_VALUES` nibble-to-float lookup tensor for NVFP4 codebook and `dequantize_nvfp4()` function that unpacks uint8 packed weights, applies per-block FP8 scaling, and outputs unpadded BF16 matrices.
W4A16 Linear Method & Routing `tensorrt_llm/_torch/modules/linear.py`	Adds `NVFP4W4A16LinearMethod` class with `_dequantize_weight()` to reconstruct BF16 from packed tensors, dequant-then-`F.linear` compute paths for normal and allreduce modes, and skipped padding in `post_load_weights()`. `get_quant_method()` routes to this method on SM < 100 with NVFP4 quantization.
MoE Load-time Dequantization Wrapper `tensorrt_llm/_torch/models/modeling_nemotron_h.py`	Sets `_nvfp4_dequant_moe` flag when NVFP4 is detected on SM < 100; wraps `self.experts.load_weights` via `_wrap_moe_load_weights_for_dequant()` to intercept uint8 expert weights, dequantize to BF16 using scales, delete quantization metadata, and pass dequantized tensors to the original loader.
Layer & MoE NVFP4 Hardware Gating `tensorrt_llm/_torch/models/modeling_nemotron_h.py`	`NemotronHLayer.is_nvfp4` now requires both NVFP4 quant mode AND FP4 hardware presence (`SM >= 100`). Per-layer `quant_config_dict` overrides are similarly gated, ensuring native W4A4 is used only on capable hardware and older GPUs take dequant fallback routes.
W4A16 Test Suite & Validation `tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py`	Covers dequant roundtrip accuracy, reference matching against AutoDeploy baseline, `NVFP4W4A16LinearMethod` forward output correctness (with SM mocking), SM-version routing for Hopper/Blackwell/ARC configs, W4A4 vs W4A16 output comparison, and multi-expert dequantization consistency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature: adding NVFP4 W4A16 inference support on Hopper GPUs for Nemotron H models, matching the changeset's core purpose.
Description check	✅ Passed	The PR description provides a clear explanation of the changes and includes test coverage details, but the PR Checklist section uses the template's boilerplate without explicit verification of all items.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tijyojwad and others added 2 commits May 11, 2026 13:40

tijyojwad requested review from a team as code owners May 11, 2026 21:26

tijyojwad requested review from shaharmor98, syuoni and yuxianq May 11, 2026 21:26

github-actions Bot assigned tijyojwad May 11, 2026

tijyojwad marked this pull request as draft May 11, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-12288][feat] Support NVFP4 W4A16 inference on Hopper for Nemotron H models#14009

[TRTLLM-12288][feat] Support NVFP4 W4A16 inference on Hopper for Nemotron H models#14009
tijyojwad wants to merge 2 commits into
NVIDIA:mainfrom
tijyojwad:jdaw/nvfp4-w4a16-hopper

tijyojwad commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tijyojwad commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tijyojwad commented May 11, 2026 •

edited by coderabbitai Bot

Loading