ModuleNotFoundError: No module named 'punica_sgmv'

### System Info

Ubuntu22.04
(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ nvidia-smi 
Fri Aug  1 08:41:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070        Off |   00000000:3B:00.0 Off |                  N/A |
|  0%   31C    P8              2W /  250W |     944MiB /  12227MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5070        Off |   00000000:86:00.0 Off |                  N/A |
|  0%   31C    P8              1W /  250W |       4MiB /  12227MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           35476      C   /usr/bin/python3                        934MiB |
+-----------------------------------------------------------------------------------------+


(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ python --version
Python 3.10.12

### Information

- [ ] Docker
- [x] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

(.venv) hbyb@hbyb:~/.cache/huggingface/hub$ text-generation-launcher --model-id Qwen/Qwen2.5-7B-Instruct    --max-total-tokens=32768    --max-input-tokens=32767    --port 8000
2025-08-01T08:42:28.742245Z  INFO text_generation_launcher: Args {
    model_id: "Qwen/Qwen2.5-7B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(
        32767,
    ),
    max_input_length: None,
    max_total_tokens: Some(
        32768,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 8000,
    prometheus_port: 9000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
    graceful_termination_timeout: 90,
}
2025-08-01T08:42:30.444175Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-08-01T08:42:30.519689Z  WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5070
2025-08-01T08:42:30.591351Z  WARN text_generation_launcher: Not enough VRAM to run the model: Available: 12.18GB - Model 14.31GB.
2025-08-01T08:42:30.591368Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-08-01T08:42:30.591380Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-08-01T08:42:30.591587Z  INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2.5-7B-Instruct
2025-08-01T08:42:39.323100Z ERROR download: text_generation_launcher: Download encountered an error: 
2025-08-01 08:42:32.368 | INFO     | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
Traceback (most recent call last):
  File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 15, in load_kernel
    m = importlib.import_module(module)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'punica_sgmv'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/bin/text-generation-server", line 4, in <module>
    from text_generation_server.cli import app
  File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/cli.py", line 10, in <module>
    from text_generation_server.utils.adapter import parse_lora_adapters
  File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/adapter.py", line 17, in <module>
    from text_generation_server.adapters.lora import LoraConfig
  File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/adapters/lora.py", line 24, in <module>
    punica_sgmv = load_kernel(
  File "/home/hbyb/TGI-RAG/text-generation-inference/server/text_generation_server/utils/kernels.py", line 19, in load_kernel
    return hf_load_kernel(repo_id=repo_id)
  File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/kernels/utils.py", line 171, in load_kernel
    snapshot_download(
  File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/hbyb/TGI-RAG/text-generation-inference/.venv/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py", line 219, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Error: DownloadError


### Expected behavior

llm run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ModuleNotFoundError: No module named 'punica_sgmv' #3306

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ModuleNotFoundError: No module named 'punica_sgmv' #3306

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions