Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769
Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the existing DQ→MatMulNBits fusion to cover Gemm and non-blockwise quantization (per-tensor/per-channel), adding a new session option to control the effective block size when expanding scales/zero-points.
Changes:
- Add selector validation for Gemm and for per-tensor/per-channel DQ patterns (with optional bias for Gemm).
- Add action logic to expand non-blockwise scales/zero-points into MatMulNBits’ expected blockwise format, wire Gemm bias into MatMulNBits, and plumb a new block-size session option.
- Add extensive new unit tests covering newly supported and negative patterns.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc | Adds tests for per-tensor/per-channel DQ→MatMul, DQ→Gemm (with/without bias), block size option, and negative cases. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h | Updates selector contract and adds an override to keep bias DQ out of removal set. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc | Implements generalized DQ validation, Gemm attribute validation, and selection updates for Gemm/bias DQ. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h | Extends transformer ctor to accept new block-size option. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc | Registers Gemm for the fusion and threads the block-size session option through the registry/action. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h | Extends fusion action to carry block-size-for-non-blockwise configuration. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc | Adds block-size computation, expansion for per-tensor/per-channel scale/zp, and Gemm attribute/bias handling. |
| onnxruntime/core/optimizer/graph_transformer_utils.cc | Reads new session option and passes it into the QDQ selector/action transformer. |
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Documents the new session.qdq_matmulnbits_block_size option. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
19181dd to
f3594ed
Compare
MakeInitializer(shape, T(1,0), T(1,0)) calls Uniform with min==max,
which creates uniform_int_distribution(1, 0) (since Uniform uses [min, max)).
This triggers assertion failure on both MSVC and GCC.
Use explicit data overload instead: MakeInitializer<T>({}, {T(1,0)})
Shrink per-tensor test dimensions from 768x768/768x3072 to 96x96/96x384. The same code paths are exercised regardless of matrix size.
Remove symmetric type/zp combos where one signed+zp and one unsigned-no-zp call sufficiently covers the code paths. Reduces Run* calls from 172 to 153 (38 fewer inference sessions).
Trim redundant type/zp/accuracy_level permutations in negative tests (NonConstDQ, FirstDQInput, ShapeMismatch) where the rejection logic doesn't depend on these parameters. Also trim new per-tensor, per-channel, Gemm, and uint8 tests to representative combinations. Session-creating invocations: 153 -> 63 (59% reduction). All 34 DQMatMul/DQGemm tests still pass.
There was a problem hiding this comment.
Pull request overview
Extends the existing DQ→MatMulNBits fusion to cover more real-world patterns by supporting Gemm and non-blockwise (per-tensor/per-channel) DequantizeLinear quantization parameters.
Changes:
- Generalized DQ validation to accept blockwise, per-tensor, and per-channel quantization; added Gemm attribute validation.
- Updated selector/action wiring to fuse DQ→(MatMul|Gemm) into MatMulNBits, including bias passthrough for Gemm and session-configured block size expansion.
- Added extensive new tests for per-tensor/per-channel and Gemm variants plus block-size option behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc | Adds new fusion tests for per-tensor/per-channel and Gemm (bias variants), and trims redundant negative test permutations. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h | Updates selector contract/comments and adds an UpdateBuilder override to avoid removing Gemm bias DQ. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc | Implements generalized DQ validation, Gemm attribute checks, Gemm support in DQ→MatMulNBits selection, and UpdateBuilder trimming. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h | Extends transformer constructor API to accept the MatMulNBits block-size session option. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc | Threads block-size option through rule registration and registers Gemm alongside MatMul for this fusion. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h | Extends action API to accept block-size for non-blockwise DQ expansion. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc | Adds block-size computation, expands per-tensor/per-channel scale/zp to blockwise format, wires Gemm bias to MatMulNBits, and updates transpose logic. |
| onnxruntime/core/optimizer/graph_transformer_utils.cc | Parses and passes the new session.qdq_matmulnbits_block_size config into the QDQ transformer. |
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Documents the new session option key for configuring MatMulNBits block size. |
Comments suppressed due to low confidence (1)
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc:624
ExtraAttributesalso assumes the weightNodeArghas a non-null shape with concrete dim values. That assumption is not guaranteed and can cause crashes during transformation. Recommended fix: readK/Nfrom the weight constant initializer’s TensorProto dims (which the selector already requires), or otherwise safely fall back toNodeArg::Shape()only when present and fully-defined.
const auto* weight_shape = dq_node->InputDefs()[0]->Shape();
utils::SetNodeAttribute(utils::MakeAttribute("K", weight_shape->dim(0).dim_value()), extra_attributes);
utils::SetNodeAttribute(utils::MakeAttribute("N", weight_shape->dim(1).dim_value()), extra_attributes);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot review: derive K and N from the already-loaded weight_tensor_proto->dims() rather than weight_arg->Shape(), which avoids an unnecessary NodeArg::Shape() dereference. Also adds an explicit rank >= 2 check on the tensor proto.
vraspar
left a comment
There was a problem hiding this comment.
Review assisted by GitHub Copilot CLI.
Thanks for the well-structured PR — clean separation between selector validation and action transformation, solid sub-byte zp handling, and good test reduction rationale.
Remaining Issues
Bug: Null-pointer dereference in ExtraAttributes (qdq_actions.cc:623)
The recent commit 61491e5 fixed the same class of issue in TransposeDQWeightsForMatMulNBits (nice!), but ExtraAttributes still dereferences weight_arg->Shape() without a null guard (line 623-626). NodeArg::Shape() can return null even for constant initializers. Should get the same treatment — derive K/N from the weight initializer proto, or add a null guard.
Lint: Missing #include <algorithm> (qdq_actions.cc:70)
CI already flagged this — std::min(K, kMaxBlockSize) needs <algorithm>.
Test Coverage Gaps
-
block_size=-1(heuristic) has no end-to-end test. TheComputeEffectiveBlockSizeheuristic logic and session option exist but are never exercised. Suggest adding a test that setsblock_size=-1and verifies the chosenblock_sizeattribute on the fused MatMulNBits node (similar to the existingDQMatMulPerTensorWithBlockSizeOptiontest). -
Gemm + per-tensor/per-channel not tested together. All 3 Gemm tests use blockwise DQ. The scale/zp expansion code + Gemm bias wiring are never exercised in combination. A
DQGemmPerTensorWithBiastest would close this gap. -
FP16 scales with per-tensor/per-channel untested. The
MLFloat16expansion branch exists but only float32 scales are tested for non-blockwise paths.
Minor nits
ComputeEffectiveBlockSizewithsession_block_size=-1andK < 16: returnsblock_size=16 > K. This works (quant_num=1, padding via memset), but a brief comment documenting this edge case would help readers.- Trailing whitespace on
ValidateGemmForDQMatMulNBitsfunction signature line (qdq_selectors.cc).
vraspar
left a comment
There was a problem hiding this comment.
Minor neats, but PR looks good
Extends the QDQ
DQMatMulToMatMulNBitsfusion to handle additional quantization patterns beyond the existing blockwise DQ→MatMul case.New support
alpha,beta,transB).session.qdq_matmulnbits_block_size(default: 32).Changes
ValidateBlockwiseDQForMatMulNBitswithValidateDQForMatMulNBitssupporting all three quantization modes. Added Gemm-specific validation.GemmalongsideMatMul; threadedqdq_matmulnbits_block_sizefrom session config.