Skip to content

Commit f9b5313

Browse files
committed
docs(readme): add documentation for Assistant Prefill features
- Also slightly updated the `huggingface_hub` installation instructions for accuracy. Signed-off-by: JamePeng <jame_peng@sina.com>
1 parent 3ead090 commit f9b5313

3 files changed

Lines changed: 46 additions & 10 deletions

File tree

README.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -344,20 +344,23 @@ By default `llama-cpp-python` generates completions in an OpenAI compatible form
344344
345345
Text completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
346346
347-
### Pulling models from Hugging Face Hub
347+
### Pulling models from [Hugging Face Hub](https://huggingface.co/models)
348348
349349
You can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.
350-
You'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).
350+
351+
You'll need to install the `huggingface_hub` package to use this feature (`pip install --upgrade huggingface_hub`).
352+
353+
351354

352355
```python
353356
llm = Llama.from_pretrained(
354-
repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
355-
filename="*q8_0.gguf",
357+
repo_id="Qwen/Qwen2.5-0.5B-Instruct-GGUF",
358+
filename="qwen2.5-0.5b-instruct-q4_k_m.gguf",
356359
verbose=False
357360
)
358361
```
359362

360-
By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.
363+
By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`hf`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.
361364

362365
### Chat Completion
363366

@@ -521,6 +524,39 @@ llm = Llama.from_pretrained(
521524
522525
---
523526
527+
## Continuing Assistant Responses (Prefill)
528+
529+
`llama-cpp-python` supports native **Assistant Prefill** for seamless message continuation. You can now simply use the `assistant_prefill=True` parameter in the `create_chat_completion` function.
530+
531+
This safely renders the `N-1` conversation history using standard Jinja templates (preserving exact control tokens) and flawlessly appends your partial text directly to the prompt.
532+
533+
```python
534+
from llama_cpp import Llama
535+
536+
llm = Llama(model_path="path/to/model.gguf")
537+
538+
# An interrupted/partial conversation
539+
messages = [
540+
{"role": "user", "content": "What are the first 5 planets in the solar system?"},
541+
{"role": "assistant", "content": "The first 5 planets in our solar system are:\n1. Mercury\n2."}
542+
]
543+
544+
# Seamlessly continue the generation
545+
response = llm.create_chat_completion(
546+
messages=messages,
547+
max_tokens=50,
548+
assistant_prefill=True # <--- Enables seamless continuation
549+
)
550+
551+
prefilled_text = messages[-1]["content"]
552+
# The model will flawlessly continue from " Venus\n3. Earth..."
553+
generated_text = response["choices"][0]["message"]["content"]
554+
555+
print(prefilled_text + generated_text)
556+
```
557+
558+
---
559+
524560
## Dynamic LoRA Routing & Control Vectors (Multi-Tenant Serving)
525561
526562
Historically, `llama-cpp-python` only supported "static loading" where a LoRA was permanently baked into the context during initialization. Switching personas required reloading the entire model or duplicating it in VRAM.

llama_cpp/llama.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3141,8 +3141,8 @@ def from_pretrained(
31413141
**kwargs: Any,
31423142
) -> "Llama":
31433143
"""Create a Llama model from a pretrained model name or path.
3144-
This method requires the huggingface-hub package.
3145-
You can install it with `pip install huggingface-hub`.
3144+
This method requires the huggingface_hub package.
3145+
You can install it with `pip install --upgrade huggingface_hub`.
31463146
31473147
Args:
31483148
repo_id: The model repo id.
@@ -3160,7 +3160,7 @@ def from_pretrained(
31603160
except ImportError:
31613161
raise ImportError(
31623162
"Llama.from_pretrained requires the huggingface-hub package. "
3163-
"You can install it with `pip install huggingface-hub`."
3163+
"You can install it with `pip install --upgrade huggingface_hub`."
31643164
)
31653165

31663166
validate_repo_id(repo_id)

llama_cpp/llama_chat_format.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3758,8 +3758,8 @@ def from_pretrained(
37583758
from huggingface_hub.utils import validate_repo_id # type: ignore
37593759
except ImportError:
37603760
raise ImportError(
3761-
"Llama.from_pretrained requires the huggingface-hub package. "
3762-
"You can install it with `pip install huggingface-hub`."
3761+
"Llama.from_pretrained requires the huggingface_hub package. "
3762+
"You can install it with `pip install --upgrade huggingface_hub`."
37633763
)
37643764

37653765
validate_repo_id(repo_id)

0 commit comments

Comments
 (0)