On Colab Tesla T4, should bnb_4bit_compute_dtype be FP16 rather than BF16 for 4-bit inference? #1903

neetscience · 2026-03-19T20:10:03Z

neetscience
Mar 19, 2026

Hi, I tested 4-bit inference on free Google Colab with a Tesla T4 and compared:

bnb_4bit_compute_dtype=torch.float16
bnb_4bit_compute_dtype=torch.bfloat16

In my benchmark, FP16 was faster and used less GPU memory than BF16, even though BF16 also loaded and ran correctly. Is this expected on a T4? More generally, even if BF16 works, should users usually prefer FP16 for bnb_4bit_compute_dtype on this type of GPU?

import time
import gc
import torch
import transformers
import bitsandbytes as bnb

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("bitsandbytes:", bnb.__version__)
print("device:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
print("bf16 supported:", torch.cuda.is_bf16_supported())

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

prompt = "Explain in one paragraph why quantization can reduce memory usage."

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()

def bench(compute_dtype):
    cleanup()
    torch.cuda.reset_peak_memory_stats()

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=compute_dtype,
    )

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
    )
    model.eval()

    for name, module in model.named_modules():
        if "Linear4bit" in type(module).__name__:
            print(f"{name}: compute_dtype={module.compute_dtype}, quant_storage={module.quant_storage}")
            break

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.inference_mode():
        _ = model.generate(
            **inputs,
            max_new_tokens=32,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )

    torch.cuda.synchronize()
    start = time.time()

    with torch.inference_mode():
        out = model.generate(
            **inputs,
            max_new_tokens=96,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )

    torch.cuda.synchronize()
    elapsed = time.time() - start
    peak_mem_gb = torch.cuda.max_memory_allocated() / 1e9

    generated_ids = out[0][inputs["input_ids"].shape[1]:]
    text = tokenizer.decode(generated_ids, skip_special_tokens=True)

    print("requested dtype:", compute_dtype)
    print("elapsed_sec:", round(elapsed, 3))
    print("peak_mem_gb:", round(peak_mem_gb, 3))
    print("generated text:", text[:200])
    print("-" * 80)

bench(torch.float16)
bench(torch.bfloat16)

In my run, FP16 was faster and used less memory than BF16, even though BF16 also loaded and ran correctly.

Thanks!

Answered by BryanBradfo

Mar 20, 2026

@neetscience Yes I think this is expected.

Your benchmark suggests that bitsandbytes is respecting the dtype you pass in, but on a T4, FP16 just seems to be the better option in practice. Even if BF16 loads and runs fine, it can still be slower and use more memory on this GPU. So if someone is using a Tesla T4, I’d probably recommend starting with bnb_4bit_compute_dtype=torch.float16 and only trying BF16 if they specifically want to benchmark it.

View full answer

BryanBradfo · 2026-03-20T10:17:14Z

BryanBradfo
Mar 20, 2026

@neetscience Yes I think this is expected.

Your benchmark suggests that bitsandbytes is respecting the dtype you pass in, but on a T4, FP16 just seems to be the better option in practice. Even if BF16 loads and runs fine, it can still be slower and use more memory on this GPU. So if someone is using a Tesla T4, I’d probably recommend starting with bnb_4bit_compute_dtype=torch.float16 and only trying BF16 if they specifically want to benchmark it.

0 replies

KOKOSde · 2026-03-24T02:52:55Z

KOKOSde
Mar 24, 2026

Yes, that is expected on a T4. T4 is Turing and does not have native BF16 throughput like Ampere and newer cards. So for 4 bit inference on Colab T4, bnb_4bit_compute_dtype=torch.float16 is usually the right choice. BF16 can run, but it is usually slower and not the better default on that GPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On Colab Tesla T4, should bnb_4bit_compute_dtype be FP16 rather than BF16 for 4-bit inference? #1903

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

On Colab Tesla T4, should bnb_4bit_compute_dtype be FP16 rather than BF16 for 4-bit inference? #1903

Uh oh!

neetscience Mar 19, 2026

Replies: 2 comments

Uh oh!

BryanBradfo Mar 20, 2026

Uh oh!

KOKOSde Mar 24, 2026

neetscience
Mar 19, 2026

BryanBradfo
Mar 20, 2026

KOKOSde
Mar 24, 2026