[Metal] Reject tensor-scale nvfp4 in qqmm#3551
Open
Brooooooklyn wants to merge 1 commit into
Open
Conversation
20f4211 to
0964646
Compare
zcbenz
reviewed
May 17, 2026
Collaborator
zcbenz
left a comment
There was a problem hiding this comment.
This check should be done in QQMatmul::eval_gpu, generally we should not check backend types in ops.cpp.
QQMatmul::eval_gpu on Metal silently dropped global_scale_x / global_scale_w in the gemv special case (pre-quantized w, M==1), producing numerically incorrect results when tensor-scale nvfp4 weights were in use. The general case already throws NYI. Add a backend-level guard at the top of QQMatmul::eval_gpu that rejects nvfp4 with global scales packed into inputs, matching the local throw style and keeping the check out of the backend-agnostic ops.cpp. Fixes ml-explore#3550.
0964646 to
eb9a4d0
Compare
Contributor
Author
|
Updated |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
QQMatmul::eval_gpuon Metal silently droppedglobal_scale_x/global_scale_win its gemv special case (pre-quantizedw,x.shape(-2) == 1), producing numerically incorrect results when tensor-scale nvfp4 weights were in use. The general case already throws[QQMatmul] NYI for the general case.qqmm()was missed when tensor-scale nvfp4 landed in #3022.This change adds a backend-level guard at the top of
QQMatmul::eval_gpuinmlx/backend/metal/quantized.cpp: whenmode_ == Nvfp4and extra inputs beyond(x, w[, scales_w])are present (i.e. global scales were packed byops.cpp), it throwsstd::runtime_error(PythonRuntimeError) — matching the local throw style used by the existing NYI path and keeping the device-specific check out of the backend-agnosticops.cpp. CUDA is unaffected.Per-group nvfp4 (no
global_scale) on Metal continues to work — that path is exercised by the existingtest_qqmvand is unchanged.Fixes #3550.
Test plan
test_qqmm_metal_global_scale_rejectedinpython/tests/test_quantized.py, assertsRuntimeErrorwhenmx.qqmmis evaluated on Metal with both global scales set. Verified the test fails onmain(silently runs to completion) and passes with this fix.python/tests/test_quantized.py(29 tests) still passes locally on Apple Silicon, includingtest_qqmvwhich exercises the gemv branch being guarded.