Skip to content

Commit 852a2e9

Browse files
committed
Update /docs/wiki/core/Llama.md
Signed-off-by: JamePeng <jame_peng@sina.com>
1 parent afd038f commit 852a2e9

1 file changed

Lines changed: 205 additions & 0 deletions

File tree

docs/wiki/core/Llama.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
```yaml
2+
---
3+
title: Llama Class
4+
class_name: Llama
5+
last_updated: 2026-04-23
6+
version_target: "latest"
7+
---
8+
```
9+
10+
## Overview
11+
The `Llama` class is the core, high-level Python wrapper for a `llama.cpp` model. It handles model loading, memory management (KV cache), tokenization, and generation (both base text completion and chat formatting). It includes advanced features like dynamic LoRA routing, hybrid model checkpointing, speculative decoding, and context shifting.
12+
13+
## Constructor (`__init__`)
14+
15+
Initialize the model and context. Note that model loading will immediately allocate RAM/VRAM based on the selected offloading parameters.
16+
17+
### Core Model & Hardware Parameters
18+
| Parameter | Type | Default | Description |
19+
| :--- | :--- | :--- | :--- |
20+
| `model_path` | `str` | **Required** | Path to the `.gguf` model file. |
21+
| `n_gpu_layers` | `int` | `0` | Number of layers to offload to GPU. Set to `-1` for all layers. |
22+
| `split_mode` | `int` | `LLAMA_SPLIT_MODE_LAYER` | How to split the model across GPUs (e.g., `LLAMA_SPLIT_MODE_ROW`). |
23+
| `main_gpu` | `int` | `0` | The primary GPU to use for intermediate results or the entire model. |
24+
| `tensor_split` | `List[float]` | `None` | Proportional split of tensors across GPUs (max `LLAMA_MAX_DEVICES`). |
25+
| `use_mmap` | `bool` | `True` | Whether to use memory mapping (mmap) if possible. |
26+
| `use_mlock` | `bool` | `False` | Force the system to keep the model in RAM, preventing swapping. |
27+
| `kv_overrides` | `Dict` | `None` | Key-value overrides for the model metadata (supports bool, int, float, str). |
28+
| `numa` | `Union[bool, int]`| `False` | NUMA strategy (e.g., `GGML_NUMA_STRATEGY_DISTRIBUTE`). |
29+
30+
### Context & Performance Parameters
31+
| Parameter | Type | Default | Description |
32+
| :--- | :--- | :--- | :--- |
33+
| `n_ctx` | `int` | `512` | Text context size. Set to `0` to load from model metadata. |
34+
| `n_batch` | `int` | `2048` | Maximum batch size for prompt processing. |
35+
| `n_ubatch` | `int` | `512` | Physical batch size. |
36+
| `n_threads` | `int` | `None` | Number of threads for generation (defaults to CPU count // 2). |
37+
| `n_threads_batch`| `int` | `None` | Number of threads for batch processing (defaults to CPU count). |
38+
| `flash_attn_type`| `int` | `AUTO` | Controls Flash Attention activation (`LLAMA_FLASH_ATTN_TYPE_AUTO`). |
39+
| `swa_full` | `bool` | `None` | Whether to use full-size SWA cache |
40+
| `kv_unified` | `bool` | `None` | Use single unified KV buffer for the KV cache of all sequences |
41+
| `type_k` / `type_v`| `int` | `None` | KV cache data type for K and V (defaults to `f16`). |
42+
| `offload_kqv` | `bool` | `True` | Whether to offload K, Q, V tensors to GPU. |
43+
44+
### Advanced & Chat Parameters
45+
| Parameter | Type | Default | Description |
46+
| :--- | :--- | :--- | :--- |
47+
| `chat_format` | `str` | `None` | String specifying the chat template (e.g., `"llama-2"`, `"chatml"`). Guessed from GGUF if None. |
48+
| `chat_handler` | `LlamaChatCompletionHandler` | `None` | Optional custom handler. See [[ChatHandlers]]. |
49+
| `draft_model` | `LlamaDraftModel` | `None` | Optional draft model for speculative decoding. |
50+
| `ctx_checkpoints` | `int` | `32` | Max context checkpoints per slot (Hybrid/SWA models). |
51+
| `checkpoint_interval`| `int`| `4096` | Token interval for saving Hybrid model checkpoints. |
52+
53+
*(Note: There are numerous additional RoPE/YaRN scaling parameters available for specialized context extension. Refer to the source code for the full list).*
54+
55+
---
56+
57+
## Core Methods
58+
59+
### `create_chat_completion`
60+
Generates a chat response using the configured `chat_format` or `chat_handler`.
61+
```python
62+
import llama_cpp
63+
64+
model = llama_cpp.Llama(model_path="models/qwen2.5-7b-instruct.gguf", n_gpu_layers=-1)
65+
66+
response = model.create_chat_completion(
67+
messages=[
68+
{"role": "system", "content": "You are a helpful assistant."},
69+
{"role": "user", "content": "Explain KV caching."}
70+
],
71+
temperature=0.7,
72+
max_tokens=2048
73+
)
74+
print(response["choices"][0]["message"]["content"])
75+
```
76+
77+
### `create_completion` / `__call__`
78+
Generates standard text completion from a raw string prompt.
79+
```python
80+
import llama_cpp
81+
82+
model = llama_cpp.Llama(model_path="models/llama-3-8b.gguf")
83+
output = model("The capital of Japan is", max_tokens=10, stop=["\n"])
84+
print(output["choices"][0]["text"])
85+
```
86+
87+
### `generate`
88+
A low-level generator yielding token IDs one by one. Highly customizable with sampling parameters, dynamic LoRA mounting, and control vectors.
89+
```python
90+
import llama_cpp
91+
92+
model = llama_cpp.Llama(model_path="models/llama-3-8b.gguf")
93+
tokens = model.tokenize(b"def fibonacci(n):")
94+
95+
for token in model.generate(tokens, top_k=40, top_p=0.95, temp=0.2):
96+
print(model.detokenize([token]).decode('utf-8'), end="", flush=True)
97+
```
98+
99+
### `eval`
100+
Low-level method to ingest and evaluate a sequence of tokens. Used internally to update the KV cache and logits. Handles **Context Shifting** automatically to prevent OOM when the token count exceeds `n_ctx`.
101+
```python
102+
# Evaluates a chunk of tokens and updates internal state
103+
model.eval(tokens=[1, 453, 234, 987], active_loras=[{"name": "coding_adapter", "scale": 1.0}])
104+
```
105+
106+
### Dynamic LoRA Management
107+
The `Llama` class allows you to load multiple LoRAs into VRAM and apply them dynamically per-generation or per-eval.
108+
* `load_lora(name: str, path: str)`: Loads an adapter into VRAM (does not apply it yet).
109+
* `unload_lora(name: str)`: Releases the specific LoRA from VRAM.
110+
* `list_loras() -> List[str]`: Returns names of all registered LoRAs.
111+
* `unload_all_loras()`: Forces VRAM release for all loaded adapters.
112+
113+
---
114+
115+
## Best Practices & Common Patterns
116+
117+
1. **Context Shifting & Prompt Caching**:
118+
119+
By default, when calling `.generate()` or `.create_completion(reset=True)`, the engine checks for the longest matching prefix in the existing KV cache. To maximize speed, keep system prompts static and only append new dialogue to avoid re-evaluating the entire history. If the context limit is reached during `eval`, the model will automatically trigger a Context Shift (discarding older tokens while attempting to keep `n_keep` tokens, usually the system prompt).
120+
121+
2. **Basic Chat with JSON Mode**:
122+
Forces the model to output valid JSON by using the `response_format` parameter.
123+
```python
124+
from llama_cpp import Llama
125+
126+
llm = Llama(model_path="path/to/model.gguf", n_gpu_layers=-1)
127+
128+
response = llm.create_chat_completion(
129+
messages=[{"role": "user", "content": "Extract name and age from: John is 30."}],
130+
response_format={"type": "json_object"},
131+
temperature=0.0
132+
)
133+
print(response["choices"][0]["message"]["content"])
134+
```
135+
136+
3. **Speculative Decoding**:
137+
138+
Accelerates generation by using a small "draft" model to predict tokens, which the larger model then validates in parallel.
139+
```python
140+
from llama_cpp import Llama
141+
from llama_cpp.llama_speculative import LlamaDraftModel
142+
143+
draft = LlamaDraftModel.from_model(Llama(model_path="tiny_draft.gguf", n_gpu_layers=-1))
144+
main_llm = Llama(model_path="large_model.gguf", n_gpu_layers=-1, draft_model=draft)
145+
146+
for chunk in main_llm.create_completion("Explain quantum physics", stream=True):
147+
print(chunk["choices"][0]["text"], end="")
148+
```
149+
150+
4. **Dynamic LoRA Routing**:
151+
152+
You can load multiple LoRAs using `load_lora()` at startup. Then, pass the `active_loras` parameter to `.generate()`, `.create_completion()`, or `.create_chat_completion()` to dynamically apply them to specific queries without reloading the base model.
153+
154+
Multi-LoRA Dynamic Switching Example:<br>
155+
156+
Load multiple adapters and apply them selectively without reloading the base model.
157+
```python
158+
llm = Llama(model_path="base_model.gguf")
159+
llm.load_lora("coding", "codellama_adapter.gguf")
160+
llm.load_lora("story", "storywriter_adapter.gguf")
161+
llm.load_lora("sql_expert", "adapters/sql_lora.gguf")
162+
163+
# Use coding adapter
164+
llm.create_completion("def sort:", active_loras=[{"name": "coding", "scale": 1.0}])
165+
166+
# Use story adapter
167+
llm.create_completion("Once upon a time", active_loras=[{"name": "story", "scale": 0.9}])
168+
169+
# Use sql adapter
170+
llm.create_completion("SELECT *", active_loras=[{"name": "sql_expert", "scale": 0.8}])v
171+
```
172+
173+
5. **Hybrid & Recurrent Architectures**:
174+
175+
The class natively detects Hybrid/Recurrent models (like LFM2VL/LFM2.5VL, Qwen3.5/3.6, Mamba or specialized SWA models(Gemma3/4)) and automatically enables the `HybridCheckpointCache`. This creates periodic save-states during large context pre-filling, allowing the model to roll back seamlessly if a generation is rejected (e.g., speculative decoding mismatches) without corrupting the recurrent state.
176+
177+
* Tips: If you are using hybrid multimodal model for building ComfyUI nodes or running single-turn API wrappers where you do not need multi-turn state rollbacks, simply initialize your Llama instance with `ctx_checkpoints=0`:
178+
179+
```python
180+
llm = Llama(
181+
model_path="./Qwen3.5-VL-9B.gguf",
182+
chat_handler=MTMDChatHandler(clip_model_path="./mmproj.gguf"),
183+
n_ctx=4096,
184+
ctx_checkpoints=0 # <-- SET THIS TO 0 TO ENABLE ZERO-LATENCY FAST PATH
185+
)
186+
```
187+
188+
---
189+
190+
## Deprecated / Changed APIs
191+
192+
> ⚠️ **Warning:** The internal embedding methods on the `Llama` class are deprecated and will be removed.
193+
194+
* `embed()`**Deprecated.**
195+
* `create_embedding()`**Deprecated.**
196+
197+
**Migration Note:** Do not use `Llama(..., embeddings=True)` combined with `model.create_embedding(...)`. Instead, use the dedicated `LlamaEmbedding` class, which offers optimized batching and reranking support.
198+
*See: [[LlamaEmbedding]]*
199+
200+
---
201+
202+
## Related Links
203+
* [[LlamaEmbedding]] - Dedicated class for text embeddings and reranking.
204+
* [[ChatHandlers]] - Customizing `LlamaChatCompletionHandler` for function calling and vision/omni models (e.g., `[[Gemma4ChatHandler]]`, `[[Qwen35ChatHandler]]`).
205+
* [[LlamaCache]] - Implementing disk or RAM-based prompt caching (LlamaRAMCache, **TrieCache**, **HybridCheckpointCache**).

0 commit comments

Comments
 (0)