NVFP4 forward + MXFP8 backward Recipe #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a proof-of-concept implementation of NVFP4 forward + MXFP8 backward training.
The work is intentionally scoped as a local PoC and serves as a foundation for subsequent iterations.
Motivations
Implementation challenge in full NVFP4 training: The initial goal was end-to-end NVFP4 (forward + backward). However, NVFP4 matmuls in cuBLASLt currently support TN-only layouts, which would require an additional transpose kernel for the backward pass. We defer this to subsequent work. (NVIDIA’s official release v2.8 already has full NVFP4 support.)
Use-case 1: More efficient than NVFP4-QAT MXFP8 backward is substantially more efficient compared to NVFP4-QAT pipelines, which still rely on 16-/32-bit backward passes.
Use-case 2: Practicality of full NVFP4 training
While NVFP4 training has advanced significantly, it still requires several supporting techniques (1) Hadamard transforms, (2) selective higher-precision layers, and (3) switching back to higher precision for the last fraction of training, as also seen in the recent MLPerf v5.1 NVFP4 submission. Therefore, MXFP8 backward can be valuable, either for last-mile convergence or from the get-go.
Quick Summary on Implementation:
Use-case studies here.