Quantization

LoRa training using Quants and Compile

Recommendations

These recommendations are constantly evolving as techniques evolve and testing is done, check back often or in the Discord for what is happening now.

Goal: Accuracy

For the best accuracy, bfloat16 or GGUF with a Q8 .gguf file is recommended. Everything else has a trade-off between quality and speed.

Goal: Speed

For the best train speed, use either int W8A8 or GGUF A8 int. int W8A8 is faster than GGUF A8 int. However if you use a smaller GGUF, like Q4_K_S of a larger model, GGUF A8 int can be faster overall if this reduces CPU offloading.

Artifacts

Some models might have visible artifacts in samples using int W8A8. Of the tested models, only Qwen seems to be affected, and it does not seem to impact training performance**. It can be fixed by enabling SVDQuant, or by using float W8A8 instead. float W8A8 is more accurate but slower than int W8A8. int W8A8 with SVDQuant is still faster than float W8A8 and seems to be accurate enough.

Qwen INT 8 vs INT 8 SVDQuant vs BF16

Other Notes

All options with A8 are faster only if Compile transformer blocks is ALSO enabled. A Nvidia RTX 40xx or higher card is required to use one of the float A8 types. The int A8 types require a RTX 30xx or higher card.

Use a Quantization Layer Filter with preset blocks for most models. Some models might not suffer from full quantization, but others do.

** statements based on limited tests / more tests required

Data Types (A technical description)

float8 (W8)

Baseline. Similar speed to bfloat16, but less VRAM. Accurate enough for most models, but Quantization Layer Filter is still recommended.

Technical: Tensor-wise quantization of weights in float8 e4m3. Dequantized to your Train Data Type during training.

nfloat4

Similar speed to bfloat16. Not accurate enough for most models. Accuracy can be improved with SVDQuant and could be on par with other data types then, but this has not been tested because GGUF models are preferred for low-bit quantizations.

Technical: Block-wise quantization of weights. Dequantized to Train Data Type during training.

float W8A8

Much faster. Less accurate in theory than float8, but tests show similar training behavior.

Technical: Tensor-wise quantization of weights in float8 e4m3. Activations are quantized token-wise during training and Linear layer matrix multiplications are performed in float8.

Nvidia RTX40xx card or newer required.

int W8A8

Even faster than float W8A8. No issues on some models, but visible artifacts using Qwen. Artifacts can be removed using SVDQuant with very little performance impact (still faster than float W8A8). Similar training behavior otherwise.

Technical: Tensor-wise quantization of weights in int8. Activations are quantized token-wise during training and Linear layer matrix multiplications are performed in int8.

Nvidia RTX30xx card or newer required.

GGUF

Loads a quantized model as-is from a GGUF file. Slower than bfloat16 but with Compile transformer blocks it runs at a similar speed**. Quantization Layer Filter has no effect: The GGUF file decides which layers are quantized. SVDQuant cannot be used.

Technical: Weights quantization according to the GGUF file, dequantized to Train Data Type during training.

GGUF A8 int

Loads a quantized model as-is from a GGUF file. Much faster at similar accuracy. The Quantization Layer Filter only affects for which layers activation quantization is used, not which layers are quantized at all - this is still according to the GGUF file.

Technical: Weights quantization according to the GGUF file. Weights are re-quantized from GGUF to int8 axis-wise. Activations are also quantized axis-wise (token-wise) during training and Linear layer matrix multiplications are performed in int8.

GGUF A8 float

Like GGUF A8 int but slower and worse training behavior**. Generally not recommended, but it could be useful for some models.

Technical: Weights quantization according to the GGUF file. Weights are re-quantized from GGUF to float8 e4m3 axis-wise. Activations are also quantized axis-wise(token-wise) during training and Linear layer matrix multiplications are performed in float8.

SVDQuant

Improves quantization quality. Can be be combined with all quantized data types except GGUF. Recommended for use with int W8A8 and Qwen, or other models and data types if you see visible artifacts in samples. SVDQuant data type bfloat16 and Rank 16 seems to be enough.

Technical: SVDQuant implementation according to https://arxiv.org/abs/2411.05007

Overview

Home

Overview

Learning

Training

Getting Started

The Program - Tab Explanation

General

Model

Data

Concepts

Validation Datasets

Prior Prediction Datasets

AR Buckets

Training

Optimizers

Advanced Optimizers

Orthogonal Optimizers

Custom Scheduler

Sampling

Backup and Saving

Tools

Additional Embeddings

Cloud

Embedding

Lora

More info

Infos, Guides and Lessons Learnt

Misc Info

Model Support

Guides

One Trainer March 2024 Guide

Manually setup OneTrainer in Runpod

Other Tools - Helpful Links

Lessons Learnt

Frequently Asked Questions

Lessons Learnt and Tutorials

For Developers

Dev Corner

Developing Locally, Training Remotely on Runpod

Quick Start for Developers

CLI Training

Docker Image

Embedding Training

Project Structure

RAM Offloading

Uh oh!

Quantization

LoRa training using Quants and Compile

Recommendations

Goal: Accuracy

Goal: Speed

Artifacts

Other Notes

Data Types (A technical description)

float8 (W8)

nfloat4

float W8A8

int W8A8

GGUF

GGUF A8 int

GGUF A8 float

SVDQuant

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Overview

Training

More info

For Developers

Clone this wiki locally