-
-
Notifications
You must be signed in to change notification settings - Fork 256
Quantization
These recommendations are constantly evolving as techniques evolve and testing is done, check back often or in the Discord for what is happening now.
For the best accuracy, bfloat16 or GGUF with a Q8 .gguf file is recommended. Everything else has a trade-off between quality and speed.
For the best train speed, use either int W8A8 or GGUF A8 int. int W8A8 is faster than GGUF A8 int. However if you use a smaller GGUF, like Q4_K_S of a larger model, GGUF A8 int can be faster overall if this reduces CPU offloading.
Some models might have visible artifacts in samples using int W8A8. Of the tested models, only Qwen seems to be affected, and it does not seem to impact training performance**. It can be fixed by enabling SVDQuant, or by using float W8A8 instead. float W8A8 is more accurate but slower than int W8A8. int W8A8 with SVDQuant is still faster than float W8A8 and seems to be accurate enough.
Qwen INT 8 vs INT 8 SVDQuant vs BF16
All options with A8 are faster only if Compile transformer blocks is ALSO enabled. A Nvidia RTX 40xx or higher card is required to use one of the float A8 types. The int A8 types require a RTX 30xx or higher card.
Use a Quantization Layer Filter with preset blocks for most models. Some models might not suffer from full quantization, but others do.
** statements based on limited tests / more tests required
Baseline.
Similar speed to bfloat16, but less VRAM. Accurate enough for most models, but Quantization Layer Filter is still recommended.
Technical: Tensor-wise quantization of weights in float8 e4m3. Dequantized to your Train Data Type during training.
Similar speed to bfloat16.
Not accurate enough for most models. Accuracy can be improved with SVDQuant and could be on par with other data types then, but this has not been tested because GGUF models are preferred for low-bit quantizations.
Technical: Block-wise quantization of weights. Dequantized to Train Data Type during training.
Much faster. Less accurate in theory than float8, but tests show similar training behavior.
Technical: Tensor-wise quantization of weights in float8 e4m3. Activations are quantized token-wise during training and Linear layer matrix multiplications are performed in float8.
Nvidia RTX40xx card or newer required.
Even faster than float W8A8.
No issues on some models, but visible artifacts using Qwen. Artifacts can be removed using SVDQuant with very little performance impact (still faster than float W8A8). Similar training behavior otherwise.
Technical: Tensor-wise quantization of weights in int8. Activations are quantized token-wise during training and Linear layer matrix multiplications are performed in int8.
Nvidia RTX30xx card or newer required.
Loads a quantized model as-is from a GGUF file. Slower than bfloat16 but with Compile transformer blocks it runs at a similar speed**.
Quantization Layer Filter has no effect: The GGUF file decides which layers are quantized. SVDQuant cannot be used.
Technical: Weights quantization according to the GGUF file, dequantized to Train Data Type during training.
Loads a quantized model as-is from a GGUF file. Much faster at similar accuracy.
The Quantization Layer Filter only affects for which layers activation quantization is used, not which layers are quantized at all - this is still according to the GGUF file.
Technical: Weights quantization according to the GGUF file. Weights are re-quantized from GGUF to int8 axis-wise. Activations are also quantized axis-wise (token-wise) during training and Linear layer matrix multiplications are performed in int8.
Like GGUF A8 int but slower and worse training behavior**. Generally not recommended, but it could be useful for some models.
Technical: Weights quantization according to the GGUF file. Weights are re-quantized from GGUF to float8 e4m3 axis-wise. Activations are also quantized axis-wise(token-wise) during training and Linear layer matrix multiplications are performed in float8.
Improves quantization quality. Can be be combined with all quantized data types except GGUF. Recommended for use with int W8A8 and Qwen, or other models and data types if you see visible artifacts in samples.
SVDQuant data type bfloat16 and Rank 16 seems to be enough.
Technical: SVDQuant implementation according to https://arxiv.org/abs/2411.05007