Skip to content

Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769

Merged
jambayk merged 7 commits intomainfrom
jambayk/mnb-tensor-channel-gemm
Mar 24, 2026
Merged

Extend DQ→MatMulNBits fusion to support Gemm + per-tensor/per-channel quantization#27769
jambayk merged 7 commits intomainfrom
jambayk/mnb-tensor-channel-gemm

Conversation

@jambayk
Copy link
Copy Markdown
Contributor

@jambayk jambayk commented Mar 19, 2026

Extends the QDQ DQMatMulToMatMulNBits fusion to handle additional quantization patterns beyond the existing blockwise DQ→MatMul case.

New support

  • Gemm: Fuses DQ→Gemm (with optional bias, including DQ bias) into MatMulNBits, stripping Gemm-specific attributes (alpha, beta, transB).
  • Per-tensor & per-channel quantization: Expands scalar/1D scales and zero-points into block-quantized format expected by MatMulNBits. Block size is configurable via session.qdq_matmulnbits_block_size (default: 32).

Changes

  • Selectors (qdq_selectors.cc): Replaced ValidateBlockwiseDQForMatMulNBits with ValidateDQForMatMulNBits supporting all three quantization modes. Added Gemm-specific validation.
  • Actions (qdq_actions.cc): Added scale/zp expansion for non-blockwise cases, Gemm attribute cleanup, and bias wiring to MatMulNBits input 5.
  • Registration (qdq_selector_action_transformer.cc): Registered Gemm alongside MatMul; threaded qdq_matmulnbits_block_size from session config.
  • Tests (qdq_matmulnbits_transformer_test.cc): Added tests for per-tensor, per-channel, Gemm (no bias, constant bias, DQ bias), block size options, and negative cases.

@jambayk jambayk requested a review from Copilot March 19, 2026 17:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the existing DQ→MatMulNBits fusion to cover Gemm and non-blockwise quantization (per-tensor/per-channel), adding a new session option to control the effective block size when expanding scales/zero-points.

Changes:

  • Add selector validation for Gemm and for per-tensor/per-channel DQ patterns (with optional bias for Gemm).
  • Add action logic to expand non-blockwise scales/zero-points into MatMulNBits’ expected blockwise format, wire Gemm bias into MatMulNBits, and plumb a new block-size session option.
  • Add extensive new unit tests covering newly supported and negative patterns.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc Adds tests for per-tensor/per-channel DQ→MatMul, DQ→Gemm (with/without bias), block size option, and negative cases.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h Updates selector contract and adds an override to keep bias DQ out of removal set.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Implements generalized DQ validation, Gemm attribute validation, and selection updates for Gemm/bias DQ.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h Extends transformer ctor to accept new block-size option.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc Registers Gemm for the fusion and threads the block-size session option through the registry/action.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h Extends fusion action to carry block-size-for-non-blockwise configuration.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Adds block-size computation, expansion for per-tensor/per-channel scale/zp, and Gemm attribute/bias handling.
onnxruntime/core/optimizer/graph_transformer_utils.cc Reads new session option and passes it into the QDQ selector/action transformer.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h Documents the new session.qdq_matmulnbits_block_size option.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc Outdated
Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Outdated
Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Outdated
Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Outdated
Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Outdated
jambayk added 4 commits March 19, 2026 19:32
MakeInitializer(shape, T(1,0), T(1,0)) calls Uniform with min==max,
which creates uniform_int_distribution(1, 0) (since Uniform uses [min, max)).
This triggers assertion failure on both MSVC and GCC.

Use explicit data overload instead: MakeInitializer<T>({}, {T(1,0)})
Shrink per-tensor test dimensions from 768x768/768x3072 to 96x96/96x384.
The same code paths are exercised regardless of matrix size.
Remove symmetric type/zp combos where one signed+zp and one unsigned-no-zp
call sufficiently covers the code paths. Reduces Run* calls from 172 to 153
(38 fewer inference sessions).
Trim redundant type/zp/accuracy_level permutations in negative tests
(NonConstDQ, FirstDQInput, ShapeMismatch) where the rejection logic
doesn't depend on these parameters. Also trim new per-tensor,
per-channel, Gemm, and uint8 tests to representative combinations.

Session-creating invocations: 153 -> 63 (59% reduction).
All 34 DQMatMul/DQGemm tests still pass.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the existing DQ→MatMulNBits fusion to cover more real-world patterns by supporting Gemm and non-blockwise (per-tensor/per-channel) DequantizeLinear quantization parameters.

Changes:

  • Generalized DQ validation to accept blockwise, per-tensor, and per-channel quantization; added Gemm attribute validation.
  • Updated selector/action wiring to fuse DQ→(MatMul|Gemm) into MatMulNBits, including bias passthrough for Gemm and session-configured block size expansion.
  • Added extensive new tests for per-tensor/per-channel and Gemm variants plus block-size option behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc Adds new fusion tests for per-tensor/per-channel and Gemm (bias variants), and trims redundant negative test permutations.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h Updates selector contract/comments and adds an UpdateBuilder override to avoid removing Gemm bias DQ.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Implements generalized DQ validation, Gemm attribute checks, Gemm support in DQ→MatMulNBits selection, and UpdateBuilder trimming.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h Extends transformer constructor API to accept the MatMulNBits block-size session option.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc Threads block-size option through rule registration and registers Gemm alongside MatMul for this fusion.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h Extends action API to accept block-size for non-blockwise DQ expansion.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Adds block-size computation, expands per-tensor/per-channel scale/zp to blockwise format, wires Gemm bias to MatMulNBits, and updates transpose logic.
onnxruntime/core/optimizer/graph_transformer_utils.cc Parses and passes the new session.qdq_matmulnbits_block_size config into the QDQ transformer.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h Documents the new session option key for configuring MatMulNBits block size.
Comments suppressed due to low confidence (1)

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc:624

  • ExtraAttributes also assumes the weight NodeArg has a non-null shape with concrete dim values. That assumption is not guaranteed and can cause crashes during transformation. Recommended fix: read K/N from the weight constant initializer’s TensorProto dims (which the selector already requires), or otherwise safely fall back to NodeArg::Shape() only when present and fully-defined.
  const auto* weight_shape = dq_node->InputDefs()[0]->Shape();

  utils::SetNodeAttribute(utils::MakeAttribute("K", weight_shape->dim(0).dim_value()), extra_attributes);
  utils::SetNodeAttribute(utils::MakeAttribute("N", weight_shape->dim(1).dim_value()), extra_attributes);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Outdated
Address Copilot review: derive K and N from the already-loaded
weight_tensor_proto->dims() rather than weight_arg->Shape(), which
avoids an unnecessary NodeArg::Shape() dereference. Also adds an
explicit rank >= 2 check on the tensor proto.
@jambayk jambayk enabled auto-merge (squash) March 23, 2026 21:59
Copy link
Copy Markdown
Contributor

@vraspar vraspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review assisted by GitHub Copilot CLI.

Thanks for the well-structured PR — clean separation between selector validation and action transformation, solid sub-byte zp handling, and good test reduction rationale.

Remaining Issues

Bug: Null-pointer dereference in ExtraAttributes (qdq_actions.cc:623)

The recent commit 61491e5 fixed the same class of issue in TransposeDQWeightsForMatMulNBits (nice!), but ExtraAttributes still dereferences weight_arg->Shape() without a null guard (line 623-626). NodeArg::Shape() can return null even for constant initializers. Should get the same treatment — derive K/N from the weight initializer proto, or add a null guard.

Lint: Missing #include <algorithm> (qdq_actions.cc:70)

CI already flagged this — std::min(K, kMaxBlockSize) needs <algorithm>.

Test Coverage Gaps

  1. block_size=-1 (heuristic) has no end-to-end test. The ComputeEffectiveBlockSize heuristic logic and session option exist but are never exercised. Suggest adding a test that sets block_size=-1 and verifies the chosen block_size attribute on the fused MatMulNBits node (similar to the existing DQMatMulPerTensorWithBlockSizeOption test).

  2. Gemm + per-tensor/per-channel not tested together. All 3 Gemm tests use blockwise DQ. The scale/zp expansion code + Gemm bias wiring are never exercised in combination. A DQGemmPerTensorWithBias test would close this gap.

  3. FP16 scales with per-tensor/per-channel untested. The MLFloat16 expansion branch exists but only float32 scales are tested for non-blockwise paths.

Minor nits

  • ComputeEffectiveBlockSize with session_block_size=-1 and K < 16: returns block_size=16 > K. This works (quant_num=1, padding via memset), but a brief comment documenting this edge case would help readers.
  • Trailing whitespace on ValidateGemmForDQMatMulNBits function signature line (qdq_selectors.cc).

vraspar
vraspar previously approved these changes Mar 23, 2026
Copy link
Copy Markdown
Contributor

@vraspar vraspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor neats, but PR looks good

@jambayk jambayk merged commit 37b863c into main Mar 24, 2026
94 of 96 checks passed
@jambayk jambayk deleted the jambayk/mnb-tensor-channel-gemm branch March 24, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants