[CUDA] Quantized GEMV by zcbenz · Pull Request #3180 · ml-explore/mlx

zcbenz · 2026-02-27T04:19:29Z

Implements a qmv kernel using CUTLASS to do vectorized dequantization and fma, which works for all types of quants.

This kernel is fast for small problems for FP32xINT8, measured on A100:

M	N	K	QMV (TFlop/s)	CUBLAS (TFlop/s)	QMV (GiB/s)	CUBLAS (GiB/s)	Speedup (x)
1	4096	4096	2.40	0.71	1351.7	1414.2	3.39
1	8192	8192	2.26	0.85	1272.7	1694.4	2.67
1	16384	16384	2.47	0.87	1389.7	1749.5	2.82

The memory bandwidth is somehow lower for FP16xINT8:

M	N	K	QMV (TFlop/s)	CUBLAS (TFlop/s)	QMV (GiB/s)	CUBLAS (GiB/s)	Speedup (x)
1	4096	4096	2.10	1.47	1118.1	1466.7	1.43
1	8192	8192	1.95	1.49	1039.1	1494.6	1.31

Independent C++ source code for profiling the kernel

The memory bandwidth drops to half for FP8/FP4/INT4 quants unfortunately, which is likely because CUTLASS does not implement fast vectorized conversions for them. We can fix it by writing specializations of dequant_fma and I'll continue in followup PRs.

This PR also does some refactoring to dispatch quantized_mamtul to the fastest kernel depending on the problems size. For now we still prefer fp_qmv over qmv for FP8/FP4 quants but eventually I will merge fp_qmv into qmv.

mlx/backend/cuda/quantized/qmm/qmm.cpp

jagrit06 · 2026-02-27T19:46:49Z

Just so I understand the comparison - in table 1, the cublas is doing FP32xFP32 and in table 2 cublas is doing FP16xFP16 ?

zcbenz · 2026-02-27T22:10:50Z

Just so I understand the comparison - in table 1, the cublas is doing FP32xFP32 and in table 2 cublas is doing FP16xFP16 ?

Yeah cublas was measured with activation dtype.

jagrit06 reviewed Feb 27, 2026

View reviewed changes

mlx/backend/cuda/quantized/qmm/qmm.cpp Outdated Show resolved Hide resolved

[CUDA] Quantized GEMV

91b0ff4

zcbenz force-pushed the qmv branch from 126500f to 91b0ff4 Compare February 27, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Quantized GEMV#3180

[CUDA] Quantized GEMV#3180
zcbenz wants to merge 1 commit intoml-explore:mainfrom
zcbenz:qmv

zcbenz commented Feb 27, 2026

Uh oh!

Uh oh!

jagrit06 commented Feb 27, 2026

Uh oh!

zcbenz commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Feb 27, 2026

Uh oh!

Uh oh!

jagrit06 commented Feb 27, 2026

Uh oh!

zcbenz commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants