You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+56Lines changed: 56 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,62 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.3.34] Dynamic LoRA Routing, Control Vectors, and Assistant Prefill
11
+
12
+
-**feat(chat_format): added assistant_prefill to seamlessly continue responses**
13
+
- This commit introduces the `assistant_prefill` parameter to the chat completion API, satisfying the highly requested need to continue interrupted or partially generated assistant messages.
14
+
- Resolves #97 (Chat completion from unfinished response)
15
+
- Usage:
16
+
- Simply set `assistant_prefill=True` in `create_chat_completion` when the final item in your `messages` list is a partial `assistant` response. The engine will use it as a prompt base and continue generating seamlessly.
17
+
- docs(readme): add documentation for Assistant Prefill features
18
+
- Also slightly updated the `huggingface_hub` installation instructions for accuracy.
19
+
20
+
-**feat(internals): implement dynamic LoRA routing and Control Vector support**
21
+
* This commit overhauls the adapter management architecture in `_internals.py` to support **dynamic, per-request LoRA routing and Control Vector (CVec) injection** with strict C++ memory safety.
22
+
23
+
* Key changes:
24
+
- Secure Memory Management: Introduced the `LlamaLoraAdapter` wrapper class to securely handle the lifecycle of `llama_adapter_lora_p` pointers, preventing VRAM leaks. Also added support for extracting ALoRA invocation tokens.
25
+
- Model-Level Registry: Added `_lora_registry` to `LlamaModel` with robust methods (`load_lora`, `unload_lora`, `unload_all_loras`) to preload adapters into VRAM. Integrated cleanup into the model's `ExitStack` and `close()` methods for deterministic memory release.
26
+
- Context-Level Dynamic Routing: Implemented `apply_loras` and `clear_loras` in `LlamaContext` to dynamically swap compute graph weights using contiguous C arrays, enabling zero-delay multi-tenant LoRA switching.
27
+
- Control Vector Integration: Added `apply_cvec` and `clear_cvec` to `LlamaContext` for representation engineering. Includes strict C++ memory layout validation (enforcing buffer zero-padding up to `n_embd * il_end`) to prevent silent write failures in the GGML backend.
28
+
- Observability & Docs: Added verbose logging for adapter/CVec application and expanded docstrings for context utility methods (e.g., threading, causal attention, warmup).
29
+
- Update README.md for Dynamic LoRA Routing & Control Vectors
30
+
31
+
- fix(types): correct llama_adapter_get_alora_invocation_tokens ctypes signature and use pointer for llama_token
32
+
33
+
- fix(types): correct llama_set_adapters_lora LoRA adapter ctypes signature and use pointer for scales
34
+
- change scale: float to float* (POINTER(c_float))
35
+
- make adapters and scales optional arrays to match C API
- Added explanatory comments detailing why n_tokens is used instead of chunk_n_pos for M-RoPE models (to prevent the system from skipping evaluation).
46
+
- Added verbose logging for hybrid cache clearance scenarios (when checkpoints are missing, restore fails, or max_checkpoints is 0).
47
+
48
+
- feat(core): add verbose debug logging to longest_token_prefix fast paths
49
+
- Added an optional `verbose` parameter to `Llama.longest_token_prefix` to explicitly log early-exit conditions. This provides crucial visibility into cache-miss behaviors during debugging by outputting the specific reason for a fast exit (e.g., empty sequence vs. mismatched first token) along with the offending sequence lengths or token values.
50
+
51
+
- Update MIT license copyright to collective authorship (2023-2026)
52
+
- Change `single-author` copyright to `The llama-cpp-python authors`
53
+
and apply standard multi-line formatting for better readability.
54
+
- Every contributor who participates and makes an effort makes the project more reliable, efficient,
55
+
and user-friendly, and they all deserve to be remembered.
56
+
- Welcome to join us in promoting the project and enriching the open-source community.
57
+
58
+
- Update CMakeLists.txt
59
+
60
+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f](https://github.com/ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f)
61
+
62
+
- feat: Sync llama.cpp llama/mtmd API Binding 20260325
63
+
64
+
More information see: https://github.com/JamePeng/llama-cpp-python/compare/6bbc8d2306319c67c9f7d0d2d0576496f3587a3c...a8cec004466493db57d3cbc043cdc897b2b37f9b
0 commit comments