Skip to content

ModuleNotFoundError: No module named 'punica_sgmv' #3306

@xxz7909

Description

@xxz7909

System Info

Ubuntu22.04
(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ nvidia-smi
Fri Aug 1 08:41:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 Off | 00000000:3B:00.0 Off | N/A |
| 0% 31C P8 2W / 250W | 944MiB / 12227MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5070 Off | 00000000:86:00.0 Off | N/A |
| 0% 31C P8 1W / 250W | 4MiB / 12227MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 35476 C /usr/bin/python3 934MiB |
+-----------------------------------------------------------------------------------------+

(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ python --version
Python 3.10.12

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ text-generation-launcher --model-id Qwen/Qwen2.5-7B-Instruct --max-total-tokens=32768 --max-input-tokens=32767 --port 8000
2025-08-01T08:42:28.742245Z INFO text_generation_launcher: Args {
model_id: "Qwen/Qwen2.5-7B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
32767,
),
max_input_length: None,
max_total_tokens: Some(
32768,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "0.0.0.0",
port: 8000,
prometheus_port: 9000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
graceful_termination_timeout: 90,
}
2025-08-01T08:42:30.444175Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-08-01T08:42:30.519689Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5070
2025-08-01T08:42:30.591351Z WARN text_generation_launcher: Not enough VRAM to run the model: Available: 12.18GB - Model 14.31GB.
2025-08-01T08:42:30.591368Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096
2025-08-01T08:42:30.591380Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-08-01T08:42:30.591587Z INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2.5-7B-Instruct
2025-08-01T08:42:39.323100Z ERROR download: text_generation_launcher: Download encountered an error:
2025-08-01 08:42:32.368 | INFO | text_generation_server.utils.import_utils::76 - Detected system cuda
Traceback (most recent call last):
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 15, in load_kernel
m = importlib.import_module(module)
File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'punica_sgmv'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/bin/text-generation-server", line 4, in
from text_generation_server.cli import app
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/cli.py", line 10, in
from text_generation_server.utils.adapter import parse_lora_adapters
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/adapter.py", line 17, in
from text_generation_server.adapters.lora import LoraConfig
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/adapters/lora.py", line 24, in
punica_sgmv = load_kernel(
File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 19, in load_kernel
return hf_load_kernel(repo_id=repo_id)
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/kernels/utils.py", line 171, in load_kernel
snapshot_download(
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py", line 219, in snapshot_download
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Error: DownloadError

Expected behavior

llm run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions