-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Hi, I am running inference with the code based off 'Quickstart with HuggingFace'
Setup
torch dtype: auto (bfloat16)
GPU: NVIDIA RTX 6000
CUDA: 12.8
Expected Behavior
Model should generate text normally without CUDA asserts.
Actual Behavior
Generation crashes with device-side assert shortly after starting. Message indicates probability tensor contains invalid values (inf/nan < 0), pointing to sampling instability in generate().
Can u please advise on recommended generation settings (e.g., forcing torch_dtype=torch.bfloat16, disabling sampling, setting do_sample=False, using torch.inference_mode(), or any required preprocessing).
Error Trace
(LLaVA-OneVision-1.5) sxxx@yyyyy:~/LLaVA-OneVision-1.5$ python ds/inference_hf.py torch_dtypeis deprecated! Usedtypeinstead! Loading checkpoint shards: 100%|██████████████████████████| 2/2 [00:01<00:00, 1.42it/s] The image processor of typeQwen2VLImageProcessoris now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class withuse_fast=False. Note that this behavior will be extended to all models in a future release. /pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0failed. Traceback (most recent call last): File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/ds/inference_hf.py", line 41, in <module> generated_ids = model.generate(**inputs, max_new_tokens=1024) File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2564, in generate result = decoding_method( File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2779, in _sample while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device): File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2597, in _has_unfinished_sequences elif this_peer_finished: torch.AcceleratorError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSA to enable device-side assertions.