Skip to content

[TRTLLM-12288][feat] Support NVFP4 W4A16 inference on Hopper for Nemotron H models#14009

Draft
tijyojwad wants to merge 2 commits into
NVIDIA:mainfrom
tijyojwad:jdaw/nvfp4-w4a16-hopper
Draft

[TRTLLM-12288][feat] Support NVFP4 W4A16 inference on Hopper for Nemotron H models#14009
tijyojwad wants to merge 2 commits into
NVIDIA:mainfrom
tijyojwad:jdaw/nvfp4-w4a16-hopper

Conversation

@tijyojwad
Copy link
Copy Markdown
Collaborator

@tijyojwad tijyojwad commented May 11, 2026

On GPUs without FP4 tensor cores (sm < 100, e.g. Hopper), dequantize NVFP4 weights to BF16 and use standard matmul instead of nvfp4_gemm. Activations remain in BF16 throughout.

Changes:

  • Add NVFP4W4A16LinearMethod: inherits NVFP4 weight storage, overrides apply() to dequant weights to BF16 + F.linear
  • Route get_quant_method() to W4A16 method when sm < 100
  • Guard is_nvfp4 in NemotronHLayer with sm >= 100 to disable fused RMSNorm+NVFP4 and Fp4QuantizedTensor on Hopper
  • MoE on Hopper: override quant config to unquantized, wrap load_weights to dequant NVFP4 expert weights to BF16 at load time
  • Add unit tests for dequant, linear forward, routing, and MoE dequant

Summary by CodeRabbit

  • New Features

    • Added NVFP4 quantization fallback support for older GPUs, enabling inference on a broader range of hardware by automatically converting quantized weights during model loading.
  • Tests

    • Added comprehensive test coverage for NVFP4 dequantization and fallback inference behavior.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

tijyojwad and others added 2 commits May 11, 2026 13:40
On GPUs without FP4 tensor cores (sm < 100, e.g. Hopper), dequantize
NVFP4 weights to BF16 and use standard matmul instead of nvfp4_gemm.
Activations remain in BF16 throughout.

Changes:
- Add NVFP4W4A16LinearMethod: inherits NVFP4 weight storage, overrides
  apply() to dequant weights to BF16 + F.linear
- Route get_quant_method() to W4A16 method when sm < 100
- Guard is_nvfp4 in NemotronHLayer with sm >= 100 to disable fused
  RMSNorm+NVFP4 and Fp4QuantizedTensor on Hopper
- MoE on Hopper: override quant config to unquantized, wrap load_weights
  to dequant NVFP4 expert weights to BF16 at load time
- Add unit tests for dequant, linear forward, routing, and MoE dequant

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Move duplicated E2M1_VALUES lookup table and dequantization logic from
NVFP4W4A16LinearMethod (linear.py) and NemotronHMOE (modeling_nemotron_h.py)
into a shared dequantize_nvfp4() function in fp4_utils.py. This makes the
FP4 dequant utility reusable by any module that needs NVFP4 weight
dequantization without duplicating the E2M1 LUT and nibble-unpacking code.

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@tijyojwad tijyojwad requested review from a team as code owners May 11, 2026 21:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 40ec62ea-585a-4a01-a811-dcd045e676e1

📥 Commits

Reviewing files that changed from the base of the PR and between e197b69 and 8ce17ef.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/quantization/utils/fp4_utils.py
  • tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py
👮 Files not reviewed due to content moderation or server errors (4)
  • tensorrt_llm/quantization/utils/fp4_utils.py
  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py

📝 Walkthrough

Walkthrough

This PR extends NVFP4 quantization support to GPUs without FP4 tensor cores by introducing a fallback dequantization path. NVFP4 weights are dequantized to BF16 at load or inference time, enabling W4A16 computation on legacy hardware (SM < 100) via a new linear method, with SM-aware routing and MoE integration.

Changes

NVFP4 W4A16 Legacy GPU Support

Layer / File(s) Summary
E2M1 Codebook & NVFP4 Dequantization Utilities
tensorrt_llm/quantization/utils/fp4_utils.py
Introduces E2M1_VALUES nibble-to-float lookup tensor for NVFP4 codebook and dequantize_nvfp4() function that unpacks uint8 packed weights, applies per-block FP8 scaling, and outputs unpadded BF16 matrices.
W4A16 Linear Method & Routing
tensorrt_llm/_torch/modules/linear.py
Adds NVFP4W4A16LinearMethod class with _dequantize_weight() to reconstruct BF16 from packed tensors, dequant-then-F.linear compute paths for normal and allreduce modes, and skipped padding in post_load_weights(). get_quant_method() routes to this method on SM < 100 with NVFP4 quantization.
MoE Load-time Dequantization Wrapper
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Sets _nvfp4_dequant_moe flag when NVFP4 is detected on SM < 100; wraps self.experts.load_weights via _wrap_moe_load_weights_for_dequant() to intercept uint8 expert weights, dequantize to BF16 using scales, delete quantization metadata, and pass dequantized tensors to the original loader.
Layer & MoE NVFP4 Hardware Gating
tensorrt_llm/_torch/models/modeling_nemotron_h.py
NemotronHLayer.is_nvfp4 now requires both NVFP4 quant mode AND FP4 hardware presence (SM >= 100). Per-layer quant_config_dict overrides are similarly gated, ensuring native W4A4 is used only on capable hardware and older GPUs take dequant fallback routes.
W4A16 Test Suite & Validation
tests/unittest/_torch/thop/parallel/test_nvfp4_w4a16.py
Covers dequant roundtrip accuracy, reference matching against AutoDeploy baseline, NVFP4W4A16LinearMethod forward output correctness (with SM mocking), SM-version routing for Hopper/Blackwell/ARC configs, W4A4 vs W4A16 output comparison, and multi-expert dequantization consistency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding NVFP4 W4A16 inference support on Hopper GPUs for Nemotron H models, matching the changeset's core purpose.
Description check ✅ Passed The PR description provides a clear explanation of the changes and includes test coverage details, but the PR Checklist section uses the template's boilerplate without explicit verification of all items.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@tijyojwad tijyojwad marked this pull request as draft May 11, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant