-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
Ubuntu22.04
(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ nvidia-smi
Fri Aug 1 08:41:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 Off | 00000000:3B:00.0 Off | N/A |
| 0% 31C P8 2W / 250W | 944MiB / 12227MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5070 Off | 00000000:86:00.0 Off | N/A |
| 0% 31C P8 1W / 250W | 4MiB / 12227MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 35476 C /usr/bin/python3 934MiB |
+-----------------------------------------------------------------------------------------+
(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ python --version
Python 3.10.12
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ text-generation-launcher --model-id Qwen/Qwen2.5-7B-Instruct --max-total-tokens=32768 --max-input-tokens=32767 --port 8000
2025-08-01T08:42:28.742245Z INFO text_generation_launcher: Args {
model_id: "Qwen/Qwen2.5-7B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
32767,
),
max_input_length: None,
max_total_tokens: Some(
32768,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "0.0.0.0",
port: 8000,
prometheus_port: 9000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
graceful_termination_timeout: 90,
}
2025-08-01T08:42:30.444175Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-08-01T08:42:30.519689Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5070
2025-08-01T08:42:30.591351Z WARN text_generation_launcher: Not enough VRAM to run the model: Available: 12.18GB - Model 14.31GB.
2025-08-01T08:42:30.591368Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096
2025-08-01T08:42:30.591380Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-08-01T08:42:30.591587Z INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2.5-7B-Instruct
2025-08-01T08:42:39.323100Z ERROR download: text_generation_launcher: Download encountered an error:
2025-08-01 08:42:32.368 | INFO | text_generation_server.utils.import_utils::76 - Detected system cuda
Traceback (most recent call last):
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 15, in load_kernel
m = importlib.import_module(module)
File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'punica_sgmv'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/bin/text-generation-server", line 4, in
from text_generation_server.cli import app
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/cli.py", line 10, in
from text_generation_server.utils.adapter import parse_lora_adapters
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/adapter.py", line 17, in
from text_generation_server.adapters.lora import LoraConfig
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/adapters/lora.py", line 24, in
punica_sgmv = load_kernel(
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 19, in load_kernel
return hf_load_kernel(repo_id=repo_id)
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/kernels/utils.py", line 171, in load_kernel
snapshot_download(
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py", line 219, in snapshot_download
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Error: DownloadError
Expected behavior
llm run.