Skip to content

[Metal] Reject tensor-scale nvfp4 in qqmm#3551

Open
Brooooooklyn wants to merge 1 commit into
ml-explore:mainfrom
mlx-node:fix/metal-qqmm-global-scale-guard
Open

[Metal] Reject tensor-scale nvfp4 in qqmm#3551
Brooooooklyn wants to merge 1 commit into
ml-explore:mainfrom
mlx-node:fix/metal-qqmm-global-scale-guard

Conversation

@Brooooooklyn
Copy link
Copy Markdown
Contributor

@Brooooooklyn Brooooooklyn commented May 14, 2026

Summary

QQMatmul::eval_gpu on Metal silently dropped global_scale_x / global_scale_w in its gemv special case (pre-quantized w, x.shape(-2) == 1), producing numerically incorrect results when tensor-scale nvfp4 weights were in use. The general case already throws [QQMatmul] NYI for the general case. qqmm() was missed when tensor-scale nvfp4 landed in #3022.

This change adds a backend-level guard at the top of QQMatmul::eval_gpu in mlx/backend/metal/quantized.cpp: when mode_ == Nvfp4 and extra inputs beyond (x, w[, scales_w]) are present (i.e. global scales were packed by ops.cpp), it throws std::runtime_error (Python RuntimeError) — matching the local throw style used by the existing NYI path and keeping the device-specific check out of the backend-agnostic ops.cpp. CUDA is unaffected.

Per-group nvfp4 (no global_scale) on Metal continues to work — that path is exercised by the existing test_qqmv and is unchanged.

Fixes #3550.

Test plan

  • Added test_qqmm_metal_global_scale_rejected in python/tests/test_quantized.py, asserts RuntimeError when mx.qqmm is evaluated on Metal with both global scales set. Verified the test fails on main (silently runs to completion) and passes with this fix.
  • Full python/tests/test_quantized.py (29 tests) still passes locally on Apple Silicon, including test_qqmv which exercises the gemv branch being guarded.

@Brooooooklyn Brooooooklyn force-pushed the fix/metal-qqmm-global-scale-guard branch from 20f4211 to 0964646 Compare May 14, 2026 07:26
Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be done in QQMatmul::eval_gpu, generally we should not check backend types in ops.cpp.

QQMatmul::eval_gpu on Metal silently dropped global_scale_x /
global_scale_w in the gemv special case (pre-quantized w, M==1),
producing numerically incorrect results when tensor-scale nvfp4
weights were in use. The general case already throws NYI.

Add a backend-level guard at the top of QQMatmul::eval_gpu that
rejects nvfp4 with global scales packed into inputs, matching the
local throw style and keeping the check out of the backend-agnostic
ops.cpp.

Fixes ml-explore#3550.
@Brooooooklyn Brooooooklyn force-pushed the fix/metal-qqmm-global-scale-guard branch from 0964646 to eb9a4d0 Compare May 17, 2026 02:55
@Brooooooklyn
Copy link
Copy Markdown
Contributor Author

Updated

@Brooooooklyn Brooooooklyn requested a review from zcbenz May 17, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal: QQMatmul::eval_gpu gemv path silently drops global_scale_x/global_scale_w for nvfp4

2 participants