On Colab Tesla T4, should bnb_4bit_compute_dtype be FP16 rather than BF16 for 4-bit inference? #1903
-
|
Hi, I tested 4-bit inference on free Google Colab with a Tesla T4 and compared:
In my benchmark, FP16 was faster and used less GPU memory than BF16, even though BF16 also loaded and ran correctly. Is this expected on a T4? More generally, even if BF16 works, should users usually prefer FP16 for bnb_4bit_compute_dtype on this type of GPU? import time
import gc
import torch
import transformers
import bitsandbytes as bnb
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("bitsandbytes:", bnb.__version__)
print("device:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
print("bf16 supported:", torch.cuda.is_bf16_supported())
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompt = "Explain in one paragraph why quantization can reduce memory usage."
def cleanup():
gc.collect()
torch.cuda.empty_cache()
def bench(compute_dtype):
cleanup()
torch.cuda.reset_peak_memory_stats()
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=compute_dtype,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
)
model.eval()
for name, module in model.named_modules():
if "Linear4bit" in type(module).__name__:
print(f"{name}: compute_dtype={module.compute_dtype}, quant_storage={module.quant_storage}")
break
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
_ = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
torch.cuda.synchronize()
start = time.time()
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=96,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
torch.cuda.synchronize()
elapsed = time.time() - start
peak_mem_gb = torch.cuda.max_memory_allocated() / 1e9
generated_ids = out[0][inputs["input_ids"].shape[1]:]
text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print("requested dtype:", compute_dtype)
print("elapsed_sec:", round(elapsed, 3))
print("peak_mem_gb:", round(peak_mem_gb, 3))
print("generated text:", text[:200])
print("-" * 80)
bench(torch.float16)
bench(torch.bfloat16)In my run, FP16 was faster and used less memory than BF16, even though BF16 also loaded and ran correctly. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
@neetscience Yes I think this is expected. Your benchmark suggests that bitsandbytes is respecting the dtype you pass in, but on a T4, FP16 just seems to be the better option in practice. Even if BF16 loads and runs fine, it can still be slower and use more memory on this GPU. So if someone is using a Tesla T4, I’d probably recommend starting with bnb_4bit_compute_dtype=torch.float16 and only trying BF16 if they specifically want to benchmark it. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, that is expected on a T4. T4 is Turing and does not have native BF16 throughput like Ampere and newer cards. So for 4 bit inference on Colab T4, |
Beta Was this translation helpful? Give feedback.
@neetscience Yes I think this is expected.
Your benchmark suggests that bitsandbytes is respecting the dtype you pass in, but on a T4, FP16 just seems to be the better option in practice. Even if BF16 loads and runs fine, it can still be slower and use more memory on this GPU. So if someone is using a Tesla T4, I’d probably recommend starting with bnb_4bit_compute_dtype=torch.float16 and only trying BF16 if they specifically want to benchmark it.