Skip to content

Conversation

@nastya236
Copy link
Contributor

@nastya236 nastya236 commented Jan 12, 2026

Column-wise quantization for tensors stored in a column-major layout. This is used when one or both inputs are passed transposed (in qqmm backward):

  • nt layout: used in the VJP to compute dL/dx (the second argument is transposed)
  • tt layout: used in the VJP to compute dL/dw (both argument are transposed)

Overview:
Input [M, K] M-major.

  • Each thread processes group_size elements in a column
  • Load to registers
  • Compute scale -> store to shared memory (pad to avoid bank conflicts)
  • Quantize -> store to shared memory (pad to avoid bank conflicts)
  • Write scales to [M, K/group_size] K-major
  • Write quantized values to [M, K/elements_per_byte] K-major

nvfp4 qqmm:

M N K layout diff %
16384 11008 4096 nn 6.3
16384 11008 4096 tn 15.9
32768 11008 4096 nn 0.9
32768 11008 4096 tn 7.8
16384 4096 11008 nn 6.3
16384 4096 11008 tn 20.1
32768 4096 11008 nn 14.5
32768 4096 11008 tn 11.4
16384 12288 4096 nn 2.4
16384 12288 4096 tn 23.8
32768 12288 4096 nn 1.3
32768 12288 4096 tn 15.3
16384 4096 12288 nn 6.4
16384 4096 12288 tn 20.0
32768 4096 12288 nn 8.9
32768 4096 12288 tn 24.9
16384 27648 5120 nn 11.7
16384 27648 5120 tn 15.9
32768 27648 5120 nn 0.7
32768 27648 5120 tn 9.4
16384 5120 27648 nn 10.6
16384 5120 27648 tn 14.2
32768 5120 27648 nn 2.3
32768 5120 27648 tn 21.7

Probably can be optimized further.
Note: fixed small bug in QQMatmul::output_shape + removed unused reorder

@nastya236 nastya236 changed the title Columnwise quantize transpose Columnwise quantize Jan 12, 2026
@nastya236 nastya236 marked this pull request as ready for review January 14, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant