Skip to content

[CUDA] Quantized GEMV#3180

Open
zcbenz wants to merge 1 commit intoml-explore:mainfrom
zcbenz:qmv
Open

[CUDA] Quantized GEMV#3180
zcbenz wants to merge 1 commit intoml-explore:mainfrom
zcbenz:qmv

Conversation

@zcbenz
Copy link
Collaborator

@zcbenz zcbenz commented Feb 27, 2026

Refs #2536.

Implements a qmv kernel using CUTLASS to do vectorized dequantization and fma, which works for all types of quants.

This kernel is fast for small problems for FP32xINT8, measured on A100:

M N K QMV (TFlop/s) CUBLAS (TFlop/s) QMV (GiB/s) CUBLAS (GiB/s) Speedup (x)
1 4096 4096 2.40 0.71 1351.7 1414.2 3.39
1 8192 8192 2.26 0.85 1272.7 1694.4 2.67
1 16384 16384 2.47 0.87 1389.7 1749.5 2.82

The memory bandwidth is somehow lower for FP16xINT8:

M N K QMV (TFlop/s) CUBLAS (TFlop/s) QMV (GiB/s) CUBLAS (GiB/s) Speedup (x)
1 4096 4096 2.10 1.47 1118.1 1466.7 1.43
1 8192 8192 1.95 1.49 1039.1 1494.6 1.31

Independent C++ source code for profiling the kernel

The memory bandwidth drops to half for FP8/FP4/INT4 quants unfortunately, which is likely because CUTLASS does not implement fast vectorized conversions for them. We can fix it by writing specializations of dequant_fma and I'll continue in followup PRs.

This PR also does some refactoring to dispatch quantized_mamtul to the fastest kernel depending on the problems size. For now we still prefer fp_qmv over qmv for FP8/FP4 quants but eventually I will merge fp_qmv into qmv.

@jagrit06
Copy link
Member

Just so I understand the comparison - in table 1, the cublas is doing FP32xFP32 and in table 2 cublas is doing FP16xFP16 ?

@zcbenz
Copy link
Collaborator Author

zcbenz commented Feb 27, 2026

Just so I understand the comparison - in table 1, the cublas is doing FP32xFP32 and in table 2 cublas is doing FP16xFP16 ?

Yeah cublas was measured with activation dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants