[Feat] Fused qknorm + quant for dpsk v2 model#963
Open
qichu-yun wants to merge 2 commits into
Open
Conversation
a984baa to
968577e
Compare
valarLip
requested changes
May 29, 2026
| ) | ||
| return y | ||
|
|
||
| q_scale = fp4_utils.e8m0_shuffle(q_scale.view(torch.float8_e8m0fnu)) |
Collaborator
There was a problem hiding this comment.
we should better remove e8m0_shuffle by other fusion
Contributor
Author
There was a problem hiding this comment.
This e8m0_shuffle() is only on the fallback path. The native SGLang-style FP4 GEMM path consumes raw q_scale directly and returns before this line, so the optimized path has no post-shuffle. It is only used when we fall back to ATOM q_b_proj, which still requires shuffled scale layout. Removing it without removing/replacing the fallback would reintroduce the scale-layout correctness issue.
valarLip
previously approved these changes
Jun 1, 2026
valarLip
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Improve DeepSeek V2 performance by enabling fused q/k RMSNorm + quantization for both FP8 and MXFP4 paths.
Technical Details
This commit enables fuse_qknorm_quant for MXFP4, preserves unshuffled MXFP4 attention projection weights for raw-scale GEMM, and routes SGLang q_b_proj through the proper FP4 GEMM path when fused quantization is used. Existing non-fused and fallback paths remain unchanged.
Test Result
before:

after:

Submission Checklist