Skip to content

Commit a184583

Browse files
committed
Bump version to 0.3.34
Signed-off-by: JamePeng <jame_peng@sina.com>
1 parent a8cec00 commit a184583

2 files changed

Lines changed: 57 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,62 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.34] Dynamic LoRA Routing, Control Vectors, and Assistant Prefill
11+
12+
- **feat(chat_format): added assistant_prefill to seamlessly continue responses**
13+
- This commit introduces the `assistant_prefill` parameter to the chat completion API, satisfying the highly requested need to continue interrupted or partially generated assistant messages.
14+
- Resolves #97 (Chat completion from unfinished response)
15+
- Usage:
16+
- Simply set `assistant_prefill=True` in `create_chat_completion` when the final item in your `messages` list is a partial `assistant` response. The engine will use it as a prompt base and continue generating seamlessly.
17+
- docs(readme): add documentation for Assistant Prefill features
18+
- Also slightly updated the `huggingface_hub` installation instructions for accuracy.
19+
20+
- **feat(internals): implement dynamic LoRA routing and Control Vector support**
21+
* This commit overhauls the adapter management architecture in `_internals.py` to support **dynamic, per-request LoRA routing and Control Vector (CVec) injection** with strict C++ memory safety.
22+
23+
* Key changes:
24+
- Secure Memory Management: Introduced the `LlamaLoraAdapter` wrapper class to securely handle the lifecycle of `llama_adapter_lora_p` pointers, preventing VRAM leaks. Also added support for extracting ALoRA invocation tokens.
25+
- Model-Level Registry: Added `_lora_registry` to `LlamaModel` with robust methods (`load_lora`, `unload_lora`, `unload_all_loras`) to preload adapters into VRAM. Integrated cleanup into the model's `ExitStack` and `close()` methods for deterministic memory release.
26+
- Context-Level Dynamic Routing: Implemented `apply_loras` and `clear_loras` in `LlamaContext` to dynamically swap compute graph weights using contiguous C arrays, enabling zero-delay multi-tenant LoRA switching.
27+
- Control Vector Integration: Added `apply_cvec` and `clear_cvec` to `LlamaContext` for representation engineering. Includes strict C++ memory layout validation (enforcing buffer zero-padding up to `n_embd * il_end`) to prevent silent write failures in the GGML backend.
28+
- Observability & Docs: Added verbose logging for adapter/CVec application and expanded docstrings for context utility methods (e.g., threading, causal attention, warmup).
29+
- Update README.md for Dynamic LoRA Routing & Control Vectors
30+
31+
- fix(types): correct llama_adapter_get_alora_invocation_tokens ctypes signature and use pointer for llama_token
32+
33+
- fix(types): correct llama_set_adapters_lora LoRA adapter ctypes signature and use pointer for scales
34+
- change scale: float to float* (POINTER(c_float))
35+
- make adapters and scales optional arrays to match C API
36+
37+
- refactor: remove legacy static LoRA initialization
38+
- Removed `lora_base`, `lora_path`, and `lora_scale` from `Llama` init parameters and state.
39+
- Dropped outdated `llama_adapter_lora_init` and `llama_set_adapters_lora` bindings in the constructor.
40+
- Restored default `use_mmap` behavior (no longer forced to False when LoRA is present).
41+
42+
* This removes the global context pollution and paves the way for the new dynamic, per-request LoRA routing architecture.
43+
44+
- chore: enhance hybrid cache logging and document M-RoPE token usage
45+
- Added explanatory comments detailing why n_tokens is used instead of chunk_n_pos for M-RoPE models (to prevent the system from skipping evaluation).
46+
- Added verbose logging for hybrid cache clearance scenarios (when checkpoints are missing, restore fails, or max_checkpoints is 0).
47+
48+
- feat(core): add verbose debug logging to longest_token_prefix fast paths
49+
- Added an optional `verbose` parameter to `Llama.longest_token_prefix` to explicitly log early-exit conditions. This provides crucial visibility into cache-miss behaviors during debugging by outputting the specific reason for a fast exit (e.g., empty sequence vs. mismatched first token) along with the offending sequence lengths or token values.
50+
51+
- Update MIT license copyright to collective authorship (2023-2026)
52+
- Change `single-author` copyright to `The llama-cpp-python authors`
53+
and apply standard multi-line formatting for better readability.
54+
- Every contributor who participates and makes an effort makes the project more reliable, efficient,
55+
and user-friendly, and they all deserve to be remembered.
56+
- Welcome to join us in promoting the project and enriching the open-source community.
57+
58+
- Update CMakeLists.txt
59+
60+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f](https://github.com/ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f)
61+
62+
- feat: Sync llama.cpp llama/mtmd API Binding 20260325
63+
64+
More information see: https://github.com/JamePeng/llama-cpp-python/compare/6bbc8d2306319c67c9f7d0d2d0576496f3587a3c...a8cec004466493db57d3cbc043cdc897b2b37f9b
65+
1066
## [0.3.33] Fixing Multimodal Image Freezes, Stabilizing Logits, and Optimized Legacy Cache Logic
1167

1268
- perf(mtmd): optimize media_id masking with bitwise AND

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.3.33"
4+
__version__ = "0.3.34"

0 commit comments

Comments
 (0)