Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization by jambayk · Pull Request #27769 · microsoft/onnxruntime

jambayk · 2026-03-19T17:40:43Z

Extends the QDQ DQMatMulToMatMulNBits fusion to handle additional quantization patterns beyond the existing blockwise DQ→MatMul case.

New support

Gemm: Fuses DQ→Gemm (with optional bias, including DQ bias) into MatMulNBits, stripping Gemm-specific attributes (alpha, beta, transB).
Per-tensor & per-channel quantization: Expands scalar/1D scales and zero-points into block-quantized format expected by MatMulNBits. Block size is configurable via session.qdq_matmulnbits_block_size (default: 32).

Changes

Selectors (qdq_selectors.cc): Replaced ValidateBlockwiseDQForMatMulNBits with ValidateDQForMatMulNBits supporting all three quantization modes. Added Gemm-specific validation.
Actions (qdq_actions.cc): Added scale/zp expansion for non-blockwise cases, Gemm attribute cleanup, and bias wiring to MatMulNBits input 5.
Registration (qdq_selector_action_transformer.cc): Registered Gemm alongside MatMul; threaded qdq_matmulnbits_block_size from session config.
Tests (qdq_matmulnbits_transformer_test.cc): Added tests for per-tensor, per-channel, Gemm (no bias, constant bias, DQ bias), block size options, and negative cases.

Copilot

Pull request overview

Extends the existing DQ→MatMulNBits fusion to cover Gemm and non-blockwise quantization (per-tensor/per-channel), adding a new session option to control the effective block size when expanding scales/zero-points.

Changes:

Add selector validation for Gemm and for per-tensor/per-channel DQ patterns (with optional bias for Gemm).
Add action logic to expand non-blockwise scales/zero-points into MatMulNBits’ expected blockwise format, wire Gemm bias into MatMulNBits, and plumb a new block-size session option.
Add extensive new unit tests covering newly supported and negative patterns.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc	Adds tests for per-tensor/per-channel DQ→MatMul, DQ→Gemm (with/without bias), block size option, and negative cases.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h	Updates selector contract and adds an override to keep bias DQ out of removal set.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc	Implements generalized DQ validation, Gemm attribute validation, and selection updates for Gemm/bias DQ.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h	Extends transformer ctor to accept new block-size option.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc	Registers Gemm for the fusion and threads the block-size session option through the registry/action.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h	Extends fusion action to carry block-size-for-non-blockwise configuration.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc	Adds block-size computation, expansion for per-tensor/per-channel scale/zp, and Gemm attribute/bias handling.
onnxruntime/core/optimizer/graph_transformer_utils.cc	Reads new session option and passes it into the QDQ selector/action transformer.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h	Documents the new `session.qdq_matmulnbits_block_size` option.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

…l quantization

MakeInitializer(shape, T(1,0), T(1,0)) calls Uniform with min==max, which creates uniform_int_distribution(1, 0) (since Uniform uses [min, max)). This triggers assertion failure on both MSVC and GCC. Use explicit data overload instead: MakeInitializer<T>({}, {T(1,0)})

Shrink per-tensor test dimensions from 768x768/768x3072 to 96x96/96x384. The same code paths are exercised regardless of matrix size.

Remove symmetric type/zp combos where one signed+zp and one unsigned-no-zp call sufficiently covers the code paths. Reduces Run* calls from 172 to 153 (38 fewer inference sessions).

Trim redundant type/zp/accuracy_level permutations in negative tests (NonConstDQ, FirstDQInput, ShapeMismatch) where the rejection logic doesn't depend on these parameters. Also trim new per-tensor, per-channel, Gemm, and uint8 tests to representative combinations. Session-creating invocations: 153 -> 63 (59% reduction). All 34 DQMatMul/DQGemm tests still pass.

Copilot

Pull request overview

Extends the existing DQ→MatMulNBits fusion to cover more real-world patterns by supporting Gemm and non-blockwise (per-tensor/per-channel) DequantizeLinear quantization parameters.

Changes:

Generalized DQ validation to accept blockwise, per-tensor, and per-channel quantization; added Gemm attribute validation.
Updated selector/action wiring to fuse DQ→(MatMul|Gemm) into MatMulNBits, including bias passthrough for Gemm and session-configured block size expansion.
Added extensive new tests for per-tensor/per-channel and Gemm variants plus block-size option behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc	Adds new fusion tests for per-tensor/per-channel and Gemm (bias variants), and trims redundant negative test permutations.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h	Updates selector contract/comments and adds an UpdateBuilder override to avoid removing Gemm bias DQ.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc	Implements generalized DQ validation, Gemm attribute checks, Gemm support in DQ→MatMulNBits selection, and UpdateBuilder trimming.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h	Extends transformer constructor API to accept the MatMulNBits block-size session option.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc	Threads block-size option through rule registration and registers Gemm alongside MatMul for this fusion.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h	Extends action API to accept block-size for non-blockwise DQ expansion.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc	Adds block-size computation, expands per-tensor/per-channel scale/zp to blockwise format, wires Gemm bias to MatMulNBits, and updates transpose logic.
onnxruntime/core/optimizer/graph_transformer_utils.cc	Parses and passes the new session.qdq_matmulnbits_block_size config into the QDQ transformer.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h	Documents the new session option key for configuring MatMulNBits block size.

Comments suppressed due to low confidence (1)

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc:624

ExtraAttributes also assumes the weight NodeArg has a non-null shape with concrete dim values. That assumption is not guaranteed and can cause crashes during transformation. Recommended fix: read K/N from the weight constant initializer’s TensorProto dims (which the selector already requires), or otherwise safely fall back to NodeArg::Shape() only when present and fully-defined.

  const auto* weight_shape = dq_node->InputDefs()[0]->Shape();

  utils::SetNodeAttribute(utils::MakeAttribute("K", weight_shape->dim(0).dim_value()), extra_attributes);
  utils::SetNodeAttribute(utils::MakeAttribute("N", weight_shape->dim(1).dim_value()), extra_attributes);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review: derive K and N from the already-loaded weight_tensor_proto->dims() rather than weight_arg->Shape(), which avoids an unnecessary NodeArg::Shape() dereference. Also adds an explicit rank >= 2 check on the tensor proto.

vraspar

Review assisted by GitHub Copilot CLI.

Thanks for the well-structured PR — clean separation between selector validation and action transformation, solid sub-byte zp handling, and good test reduction rationale.

Remaining Issues

Bug: Null-pointer dereference in `ExtraAttributes` (qdq_actions.cc:623)

The recent commit 61491e5 fixed the same class of issue in TransposeDQWeightsForMatMulNBits (nice!), but ExtraAttributes still dereferences weight_arg->Shape() without a null guard (line 623-626). NodeArg::Shape() can return null even for constant initializers. Should get the same treatment — derive K/N from the weight initializer proto, or add a null guard.

Lint: Missing `#include <algorithm>` (qdq_actions.cc:70)

CI already flagged this — std::min(K, kMaxBlockSize) needs <algorithm>.

Test Coverage Gaps

block_size=-1 (heuristic) has no end-to-end test. The ComputeEffectiveBlockSize heuristic logic and session option exist but are never exercised. Suggest adding a test that sets block_size=-1 and verifies the chosen block_size attribute on the fused MatMulNBits node (similar to the existing DQMatMulPerTensorWithBlockSizeOption test).
Gemm + per-tensor/per-channel not tested together. All 3 Gemm tests use blockwise DQ. The scale/zp expansion code + Gemm bias wiring are never exercised in combination. A DQGemmPerTensorWithBias test would close this gap.
FP16 scales with per-tensor/per-channel untested. The MLFloat16 expansion branch exists but only float32 scales are tested for non-blockwise paths.

Minor nits

ComputeEffectiveBlockSize with session_block_size=-1 and K < 16: returns block_size=16 > K. This works (quant_num=1, padding via memset), but a brief comment documenting this edge case would help readers.
Trailing whitespace on ValidateGemmForDQMatMulNBits function signature line (qdq_selectors.cc).

vraspar

Minor neats, but PR looks good

jambayk requested a review from Copilot March 19, 2026 17:40

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Extend DQ->MatMulNBits fusion to support Gemm + per-tensor/per-channe…

f3594ed

…l quantization

jambayk force-pushed the jambayk/mnb-tensor-channel-gemm branch from 19181dd to f3594ed Compare March 19, 2026 18:13

Copilot started reviewing on behalf of jambayk March 19, 2026 18:22 View session

jambayk added 4 commits March 19, 2026 19:32

Reduce test matrix sizes to lower memory pressure under ASan

0165da2

Shrink per-tensor test dimensions from 768x768/768x3072 to 96x96/96x384. The same code paths are exercised regardless of matrix size.

Trim redundant test combinations to reduce ASan memory pressure

ea2c6e4

Remove symmetric type/zp combos where one signed+zp and one unsigned-no-zp call sufficiently covers the code paths. Reduces Run* calls from 172 to 153 (38 fewer inference sessions).

jambayk requested review from hariharans29 and vraspar March 23, 2026 18:21

vraspar requested a review from Copilot March 23, 2026 18:36

Copilot started reviewing on behalf of vraspar March 23, 2026 18:42 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc

Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Outdated

jambayk enabled auto-merge (squash) March 23, 2026 21:59

vraspar reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

vraspar previously approved these changes Mar 23, 2026

View reviewed changes

null guard

1a51c7a

jambayk dismissed vraspar’s stale review via 1a51c7a March 23, 2026 23:00

jambayk requested a review from vraspar March 23, 2026 23:01

vraspar approved these changes Mar 24, 2026

View reviewed changes

jambayk merged commit 37b863c into main Mar 24, 2026
94 of 96 checks passed

jambayk deleted the jambayk/mnb-tensor-channel-gemm branch March 24, 2026 10:16

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769

Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769
jambayk merged 7 commits intomainfrom
jambayk/mnb-tensor-channel-gemm

jambayk commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

vraspar left a comment

Uh oh!

Uh oh!

Uh oh!

vraspar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jambayk commented Mar 19, 2026

New support

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

vraspar left a comment

Choose a reason for hiding this comment

Remaining Issues

Bug: Null-pointer dereference in ExtraAttributes (qdq_actions.cc:623)

Lint: Missing #include <algorithm> (qdq_actions.cc:70)

Test Coverage Gaps

Minor nits

Uh oh!

Uh oh!

Uh oh!

vraspar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug: Null-pointer dereference in `ExtraAttributes` (qdq_actions.cc:623)

Lint: Missing `#include <algorithm>` (qdq_actions.cc:70)