Getting Numerical Instability in LLaVA inferece

Hi, I am running inference with the code based off ['Quickstart with HuggingFace'](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct)

**Setup**
torch dtype: auto (bfloat16)
GPU: NVIDIA RTX 6000
CUDA: 12.8

**Expected Behavior**
Model should generate text normally without CUDA asserts.

**Actual Behavior**
Generation crashes with device-side assert shortly after starting. Message indicates probability tensor contains invalid values (inf/nan < 0), pointing to sampling instability in generate().

Can u please advise on recommended generation settings (e.g., forcing torch_dtype=torch.bfloat16, disabling sampling, setting do_sample=False, using torch.inference_mode(), or any required preprocessing). 

Error Trace
`(LLaVA-OneVision-1.5) sxxx@yyyyy:~/LLaVA-OneVision-1.5$ python ds/inference_hf.py
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████| 2/2 [00:01<00:00,  1.42it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/ds/inference_hf.py", line 41, in <module>
    generated_ids = model.generate(**inputs, max_new_tokens=1024)
  File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2564, in generate
    result = decoding_method(
  File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2779, in _sample
    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
  File "/mnt/hdd/sda/samus/LLaVA-OneVision-1.5/.venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2597, in _has_unfinished_sequences
    elif this_peer_finished:
torch.AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting Numerical Instability in LLaVA inferece #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Getting Numerical Instability in LLaVA inferece #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions