You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+41-5Lines changed: 41 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -344,20 +344,23 @@ By default `llama-cpp-python` generates completions in an OpenAI compatible form
344
344
345
345
Text completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
346
346
347
-
### Pulling models from Hugging Face Hub
347
+
### Pulling models from [Hugging Face Hub](https://huggingface.co/models)
348
348
349
349
You can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.
350
-
You'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).
350
+
351
+
You'll need to install the `huggingface_hub` package to use this feature (`pip install --upgrade huggingface_hub`).
352
+
353
+
351
354
352
355
```python
353
356
llm = Llama.from_pretrained(
354
-
repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
355
-
filename="*q8_0.gguf",
357
+
repo_id="Qwen/Qwen2.5-0.5B-Instruct-GGUF",
358
+
filename="qwen2.5-0.5b-instruct-q4_k_m.gguf",
356
359
verbose=False
357
360
)
358
361
```
359
362
360
-
By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.
363
+
By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`hf`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.
361
364
362
365
### Chat Completion
363
366
@@ -521,6 +524,39 @@ llm = Llama.from_pretrained(
521
524
522
525
---
523
526
527
+
## Continuing Assistant Responses (Prefill)
528
+
529
+
`llama-cpp-python` supports native **Assistant Prefill** for seamless message continuation. You can now simply use the `assistant_prefill=True` parameter in the `create_chat_completion` function.
530
+
531
+
This safely renders the `N-1` conversation history using standard Jinja templates (preserving exact control tokens) and flawlessly appends your partial text directly to the prompt.
532
+
533
+
```python
534
+
from llama_cpp import Llama
535
+
536
+
llm = Llama(model_path="path/to/model.gguf")
537
+
538
+
# An interrupted/partial conversation
539
+
messages = [
540
+
{"role": "user", "content": "What are the first 5 planets in the solar system?"},
541
+
{"role": "assistant", "content": "The first 5 planets in our solar system are:\n1. Mercury\n2."}
## Dynamic LoRA Routing & Control Vectors (Multi-Tenant Serving)
525
561
526
562
Historically, `llama-cpp-python` only supported "static loading" where a LoRA was permanently baked into the context during initialization. Switching personas required reloading the entire model or duplicating it in VRAM.
0 commit comments