Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

[0.3.35] Gemma 4 series & LFM 2.5-VL Support, OpenAI OpenAPI Alignment and Logging Architecture Migration

fix: expand stop sequences for Gemma4ChatHandler
- Add GEMMA4_EOS_TOKEN and GEMMA4_STR_TOKEN to the generation stop criteria.
- Align the stopping logic with the model's generation_config.json definitions.
- Prevent potential over-generation by ensuring the model halts correctly at standard EOS or when initiating a tool response.
feat(types): align with latest OpenAI OpenAPI spec (audio, structured outputs)
- Update llama_types.py OpenAI OpenAPI Link
- Add developer role.
- Replace Anyscale-specific JSON schema with official OpenAI json_schema response format for Structured Outputs.
- Add input_audio and file types to request message content parts.
- Add audio, refusal, and annotations (e.g., URL citations) fields to response messages.
- Add content_filter to finish reasons and strictly define global ChatCompletionRole.
docs: clarify enable_thinking compatibility for Gemma 4 models
- Update Gemma4ChatHandler class docstring and __init__ args documentation.
- Specify that the enable_thinking toggle is exclusively supported by Gemma4 31B and 26BA4B variants.
- Explicitly note that E2B and E4B models do not currently support this feature to prevent configuration errors.
feat(chat_format): Implemented Gemma4ChatHandler, add Gemma 4 chat handler with multimodal and tool support
- Implement Gemma4ChatHandler with Gemma 4 specific tokens (<|turn>, <|channel>, etc.).
- Add complex Jinja2 template for advanced nested tool/function schema formatting.
- Support multimodal content injection for image_url, audio_url, and input_audio (including base64 reconstruction).
- Integrate reasoning/thinking controls via enable_thinking toggle and <|channel>thought formatting.
- Configure <turn|> as the primary stop sequence for generation boundaries.
feat(chat_format) Implemented LFM25VLChatHandler for LFM2.5-VL (by @alcoftTAO)
fix Qwen3.5 chat template typos(reported by @abdullah-cod9)
refactor(logger): migrate from llama_log_callback to ggml_log_callback
- Remove the deprecated llama_log_callback typedef from llama_cpp.py.
- Update _logger.py to use ggml_log_callback from _ggml, aligning with the upstream GGML logging architecture.
- Rename the callback references across the codebase, including the MTMD context initialization in llama_chat_format.py.
feat(ggml): add support for ggml-base library and new function bindings
- Load the new ggml-base shared library alongside ggml.
- Add ctypes bindings for ggml_log_get, ggml_log_set, and ggml_set_zero using the ggml_base_function decorator.
Update README.md
feat: Update llama.cpp to ggml-org/llama.cpp/commit/58190cc84d846d8575ba26e8486bc29d9fd8ad55
feat: Sync llama.cpp llama/mtmd API Binding 20260402

More information see: https://github.com/JamePeng/llama-cpp-python/compare/a184583e908cc138fd15794986b3581521fb9b0c...232092e32b3563159a86aacb168da06c4937192b

[0.3.34] Dynamic LoRA Routing, Control Vectors, and Assistant Prefill

feat(chat_format): added assistant_prefill to seamlessly continue responses
- This commit introduces the assistant_prefill parameter to the chat completion API, satisfying the highly requested need to continue interrupted or partially generated assistant messages.
- Resolves #97 (Chat completion from unfinished response)
- Usage:
  - Simply set assistant_prefill=True in create_chat_completion when the final item in your messages list is a partial assistant response. The engine will use it as a prompt base and continue generating seamlessly.
- docs(readme): add documentation for Assistant Prefill features
  - Also slightly updated the huggingface_hub installation instructions for accuracy.
feat(internals): implement dynamic LoRA routing and Control Vector support
- This commit overhauls the adapter management architecture in _internals.py to support dynamic, per-request LoRA routing and Control Vector (CVec) injection with strict C++ memory safety.
- Key changes:
  - Secure Memory Management: Introduced the LlamaLoraAdapter wrapper class to securely handle the lifecycle of llama_adapter_lora_p pointers, preventing VRAM leaks. Also added support for extracting ALoRA invocation tokens.
  - Model-Level Registry: Added _lora_registry to LlamaModel with robust methods (load_lora, unload_lora, unload_all_loras) to preload adapters into VRAM. Integrated cleanup into the model's ExitStack and close() methods for deterministic memory release.
  - Context-Level Dynamic Routing: Implemented apply_loras and clear_loras in LlamaContext to dynamically swap compute graph weights using contiguous C arrays, enabling zero-delay multi-tenant LoRA switching.
  - Control Vector Integration: Added apply_cvec and clear_cvec to LlamaContext for representation engineering. Includes strict C++ memory layout validation (enforcing buffer zero-padding up to n_embd * il_end) to prevent silent write failures in the GGML backend.
  - Observability & Docs: Added verbose logging for adapter/CVec application and expanded docstrings for context utility methods (e.g., threading, causal attention, warmup).
  - Update README.md for Dynamic LoRA Routing & Control Vectors
fix(types): correct llama_adapter_get_alora_invocation_tokens ctypes signature and use pointer for llama_token
fix(types): correct llama_set_adapters_lora LoRA adapter ctypes signature and use pointer for scales
- change scale: float to float* (POINTER(c_float))
- make adapters and scales optional arrays to match C API
refactor: remove legacy static LoRA initialization
- Removed lora_base, lora_path, and lora_scale from Llama init parameters and state.
- Dropped outdated llama_adapter_lora_init and llama_set_adapters_lora bindings in the constructor.
- Restored default use_mmap behavior (no longer forced to False when LoRA is present).
- This removes the global context pollution and paves the way for the new dynamic, per-request LoRA routing architecture.
chore: enhance hybrid cache logging and document M-RoPE token usage
- Added explanatory comments detailing why n_tokens is used instead of chunk_n_pos for M-RoPE models (to prevent the system from skipping evaluation).
- Added verbose logging for hybrid cache clearance scenarios (when checkpoints are missing, restore fails, or max_checkpoints is 0).
feat(core): add verbose debug logging to longest_token_prefix fast paths
- Added an optional verbose parameter to Llama.longest_token_prefix to explicitly log early-exit conditions. This provides crucial visibility into cache-miss behaviors during debugging by outputting the specific reason for a fast exit (e.g., empty sequence vs. mismatched first token) along with the offending sequence lengths or token values.
Update MIT license copyright to collective authorship (2023-2026)
- Change single-author copyright to The llama-cpp-python authors and apply standard multi-line formatting for better readability.
- Every contributor who participates and makes an effort makes the project more reliable, efficient, and user-friendly, and they all deserve to be remembered.
- Welcome to join us in promoting the project and enriching the open-source community.
Update CMakeLists.txt
feat: Update llama.cpp to ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f
feat: Sync llama.cpp llama/mtmd API Binding 20260325

More information see: https://github.com/JamePeng/llama-cpp-python/compare/6bbc8d2306319c67c9f7d0d2d0576496f3587a3c...a8cec004466493db57d3cbc043cdc897b2b37f9b

[0.3.33] Fixing Multimodal Image Freezes, Stabilizing Logits, and Optimized Legacy Cache Logic

perf(mtmd): optimize media_id masking with bitwise AND
- Replaced the modulo operation (% (2**24)) with a bitwise AND mask (& 0xFFFFFF) when calculating the deterministic media_id from the CRC32 hash. This is a micro-optimization that leverages faster native CPU bitwise instructions instead of division, resulting in more idiomatic and performant low-level bit masking.
- When processing 1-2 images, the difference between % and & is only a few nanoseconds, imperceptible to the user. In future video processing, you might need to frantically calculate IDs within a for loop for 100 or even 300 frames of data. In this case, the extremely low CPU overhead of the bitwise operation & 0xFFFFFF ensures that the main thread will not experience any computational blockage at the Python layer when building a virtual ledger of tens of thousands of characters.
fix(mtmd): prevent multimodal image freeze by injecting deterministic media IDs
- Fixed a critical "image freeze" bug (report by @KLL535) where the model would continuously reuse the first cached image regardless of new inputs.
- The issue occurred because the C++ self._mtmd_cpp.mtmd_input_chunk_get_id(chunk) parser returns an empty ID (b'') for placeholder tokens like <__media__>, causing all media to fallback to the same magic number (-314159). This resulted in false-positive KV cache prefix matches.
- Replaced the C++ chunk ID extraction with a Python-side media_item_cursor that generates a deterministic 32-bit negative ID using zlib.crc32(real_media_url).
- This ensures longest_token_prefix correctly identifies and reuses identical images while instantly breaking the cache match when the image content changes.
- Chat Structure: [System Prompt] + [Image] + [Question]
- The computational cost should be significantly reduced for the same image but different questions.
fix(Llama.generate): add explicit fallback context reset and expand Llama.generate docstrings
- Added a fallback if reset: block in Llama.generate to ensure the KV cache and hybrid cache manager are explicitly cleared when reset=True is passed and no prefix match is found. This prevents potential context poisoning from previous runs.
- Added comprehensive docstrings to the generate method for all newly integrated sampler parameters (e.g., XTC, Mirostat, DRY penalties etc.).
- Added explicit verbose logging for cache resetting, rollback events, and speculative decoding behaviors to improve debuggability.
docs(README.md): add sampling parameter guide and strategic project tips to README
- Sampling Documentation: Added a comprehensive guide for LlamaSamplingParams. It covers core, advanced (XTC, Dynatemp, Adaptive-P), entropy (Mirostat), and DRY repetition penalty configurations with a clean Python usage example.
- Project Tips: Added a new "Quick tips" section to explicitly communicate the semi-deprecated status of llama_cpp.server in favor of the upstream llama-server.
- Backend Recommendations: Added practical advice for AMD and Intel GPU users, officially recommending the Vulkan backend for cross-platform stability and faster updates.
fix(sampling): prevent memory drift and hallucinations in logits view
- Previously, the numpy view for logits_ptr in LlamaSamplingContext.sample was only initialized once. If the underlying C++ buffer was reallocated or shifted (e.g., due to dynamic batch sizes or KV cache shifts), the numpy array would point to stale memory, leading to severe model hallucinations or segfaults.
- Added explicit _logits_ptr_addr tracking to monitor the physical C memory address.
- The zero-copy numpy view (_logits_view) is now safely recreated on-the-fly whenever the backend memory address changes.
- Added proper initialization and garbage collection for the new tracker in __init__ and close.
feat(_ggml): extend ctypes bindings with more ggml constants, enums, and structs
fix(chat_handler): fix tools and function calling in MTMDChatHandler.(Issue reported by @alcoftTAO)
perf(cache): optimize LlamaDiskCache I/O and fix LRU behavior
- Delegated LRU and size limits to native diskcache SQLite engine, removing the slow manual eviction loop.
- Added an O(1) early exit in _find_longest_prefix_key to prevent unnecessary full-table disk scans.
- Fixed a destructive read bug by replacing .pop() with standard access to properly update LRU timestamps.
- Added fast-path empty checks to bypass disk queries entirely when the cache is empty.
perf(cache): upgrade LlamaRAMCache to O(1) eviction and set LlamaTrieCache as default Addressed severe performance bottlenecks in legacy RAM caching components:
- Refactored LlamaRAMCache to use an O(1) _current_size tracker instead of an O(N) dynamic sum. This eliminates massive CPU spikes and O(N^2) complexity during LRU eviction cycles.
- Added strict OOM safeguards to LlamaRAMCache: The current size is explicitly clamped to 0 during evictions, and hard-reset to 0 if the cache empties, preventing catastrophic capacity drift.
- Introduced early-exit O(1) short-circuits in __getitem__ and __contains__ to bypass expensive prefix searches when the cache is empty.
- Updated the LlamaCache backward-compatibility alias to point to the highly optimized LlamaTrieCache instead of the legacy LlamaRAMCache.
fix(core): disable swa_full for non-SWA models (sync llama.cpp upstream #20291)
- Fallback context_params.swa_full to False if _n_swa == 0 and emit a warning.
- Updated is_hybrid validation to use the resolved self.context_params.swa_full state.
fix(chat_format): fix namespace and variable shadowing of llama modules
- Changed imports to use llama_cpp_lib and llama_core to avoid namespace collisions.
- Fixed severe variable shadowing where the llama module was being overshadowed by the llama parameter in function signatures.
- Updated associated type hints and C-API bindings to use the new isolated aliases.
- Corrected LlamaGrammar type definitions to point to the llama_grammar module.
fix(cache): fix namespace shadowing to prevent AttributeError (Issue reported by @kantan-kanto)
- Renamed llama_cpp.llama import to llama_core and llama_cpp.llama_cpp to llama_cpp_lib to prevent namespace collision.
- Fixed AttributeError thrown when accessing llama_cpp.llama.Llama.longest_token_prefix.
- Updated all associated type hints and C-API bindings in cache classes to use the new isolated aliases.
feat(MTMDChatHandler): support audio inputs and fix interleaved media ordering
- Refactored CHAT_FORMAT to use a single loop for message.content, preserving the exact chronological order of interleaved text, images, and audio.
- Added template routing for audio_url.
- Added template routing for OpenAI's input_audio format, properly formatting it as a Data URI.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/d23355afc319f598d0e588a2d16a4da82e14ff41
feat: Sync llama.cpp llama/mtmd API Binding 20260313

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e7e1d48065ba53846f290cfe563c8c839a062ebe...b0f00a96d3803863dd7d479f0cb1305f76f741b3

[0.3.32] Hybrid/Multimodal Model Single-Turn Optimizations & Fix Sampling Seed

perf(hybrid): optimize multimodal single-turn and fix KV clear bug
- Added a 100% match "FAST PATH" in Llama.generate to bypass N-1 truncation for hybrid models when caching is disabled.
- Fixed a bug where failed rollbacks on disabled caches would wipe the KV cache, causing multimodal pseudo-token crashes.
- Updated MTMDChatHandler to suppress cache-related logs and anchoring logic when max_checkpoints <= 0.
perf(hybrid): prevent expensive array slicing when cache is disabled
- Added a max_checkpoints > 0 check to the finally block of the generation loop.
- Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing self._input_ids[:self.n_tokens].tolist(). For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
perf(hybrid): bypass N-1 evaluation split if max_checkpoints is 0
- Prevent fragmenting the prompt evaluation into len(tokens)-1 and 1 when hybrid caching is disabled.
- Allows the underlying C++ engine to process the entire prompt in a single, efficient batch for single-turn workflows.
perf(hybrid): eliminate PCIe I/O latency for single-turn workflows
- This commit introduces critical performance optimizations and log tracing improvements for HybridCheckpointCache in single-turn workflows (e.g., ComfyUI or single-turn conversation mode):
  - Now support 0 HybridCheckpointCache for single-turn conversation.(set the ctx_checkpoints=0 when llama init )
  - Added early-exit intercepts for max_checkpoints <= 0 in save_checkpoint and find_best_checkpoint. This prevents massive (e.g., 150MB+) synchronous VRAM-to-RAM state extractions over the PCIe bus when rollback capabilities are disabled, eliminating a ~3-second blocking delay at the end of generation.
  - Added a non-empty check in clear() to prevent log spam when the cache is already empty or disabled.
  - Standardized logging prefixes (e.g., HybridCheckpointCache(save_checkpoint)) for better observability.
  - Fixed a potential UnicodeEncodeError hazard in warning logs by replacing a non-standard arrow character with standard ASCII (->).
fix(sampling): pass seed to sampling context and remove global mutation
- Add seed parameter to generate and sample method signatures.
- Pass the resolved seed directly to LlamaSamplingParams to ensure the underlying C++ sampler uses it.
- Remove thread-unsafe self.set_seed() calls in _create_completion to prevent global state pollution during concurrent requests.
docs(issue-template): modernize bug report for efficiency
- Completely revamped the legacy bug report template to streamline troubleshooting. Added an anti-AI-spam policy, a detailed OS/Hardware matrix, forced verbose=True logging requirements with code examples, and new sections for model parameters and AI-assisted brainstorming.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25

[0.3.31] Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging

refactor(mtmd): introduce omni-modal media pipeline with experimental audio support This commit significantly overhauls the media parsing and loading pipeline in MTMDChatHandler to gracefully handle both vision and audio inputs, establishing a true omni-modal architecture.

Key structural changes:
- Hardware Capability Sniffing: _init_mtmd_context now actively probes the C++ backend for ctx_v (vision) and ctx_a (audio) encoders, enabling proactive fail-fast validation before media processing.
- Unified Media Extraction: Replaced get_image_urls and split_text_on_image_urls with a robust _get_media_items method. This safely parses image_url, input_audio, and audio_url while strictly maintaining the chronological order of user prompts and enforcing OpenAI format specs.
- Media Dispatcher & Magic Bytes: Introduced a unified load_media dispatcher. Added a new _load_audio method and a rigorous detect_audio_format static method that accurately mimics llama.cpp's C++ magic bytes sniffing (RIFF/WAVE, ID3/MPEG, fLaC) to prevent fatal backend crashes.
- Concurrent Omni-Decoding: The ThreadPoolExecutor in _process_mtmd_prompt has been upgraded to concurrently fetch and decode both image and audio payloads into unified mtmd_bitmap structures.
- Note: Audio processing capabilities in the underlying llama.cpp engine are currently in an experimental stage.
fix(hybrid): implement N-1 checkpointing to support 1-token rollbacks
- Forces an N-1 state snapshot during prompt prefilling for hybrid models. This ensures the engine can safely perform a 1-token rollback to refresh logits upon 100% cache matches (e.g., changing seeds on identical prompts), preventing RNN state desyncs and empty outputs.
- Note: For the Comfyui plugin developer, I recommend performing a reset operation before inputting the prompt word. This way, the seed will be included as one of the factors in the initial complete recalculation.
fix(mtmd): remove OS-level log suppression to expose critical C++ errors
- Removed the suppress_stdout_stderr context manager around critical C++ backend calls (_init_mtmd_context, _create_bitmap_from_bytes, and close).
- Previously, when verbose=False, this OS-level file descriptor redirection was swallowing fatal C++ backend errors (e.g., stb_image decoding failures, corrupted .mmproj model weights, or CUDA Out-Of-Memory aborts), resulting in silent crashes that were impossible to debug. The framework now correctly relies on the native C-API llama_log_callback to route logs to Python gracefully, ensuring that critical decoding and hardware exceptions remain visible to the developer.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/f5ddcd1696eca5069dc7915f4d4c03c9a709afea

[0.3.30] Milestone Release

I will update the release notes for version 0.3.30 in the discussion.

refactor(mtmd): redesign multimodal pipeline for concurrent I/O and hybrid state management This commit fundamentally restructures the MTMDChatHandler pipeline, decoupling the prefill and evaluation stages to resolve previous I/O bottlenecks and state-sync issues. The new architecture fully supports hybrid/recurrent multimodal models (e.g., Qwen3.5s, LFM2-VL) with robust context management.

Key structural advantages and changes:
- Concurrent Media Decoding: Implemented ThreadPoolExecutor in _process_mtmd_prompt with pre-allocated arrays, allowing thread-safe parallel image/audio decoding while strictly preserving the chronological order of user inputs, and can be used in the future to process large numbers of video frames.
- O(1) Prefix Matching ("Negative Reverse Vocabulary"): Replaced slow dictionary lookups with a deterministic hash-to-negative-integer mapping for media IDs. This isolates media tokens from the LLM's positive vocabulary space, enabling native, ultra-fast longest_token_prefix array comparisons in Python.
- Hybrid Model State Management: Replaced aggressive mid-turn saving with highly efficient "End-of-Turn" checkpointing. This ensures multi-image prompts consume only a single LRU slot while allowing precise rollback to the nearest valid state upon cache misses.
- Robust Context Shift (OOM Defense): The __call__ loop now preemptively calculates token boundaries for upcoming multimodal chunks, safely discarding the oldest unpinned tokens from both the physical KV cache and the Python virtual ledger to prevent backend crashes.
- Qwen3.5 Support CONFRIMED, waiting Qwen35ChatHandler PR merge
merge: Implemented Qwen35ChatHandler for Qwen3.5(by @alcoftTAO)
fix: Correct the mtmd vision check condition bug
refactor(chat_handler): extract MTMDChatHandler base class and Simplify the complexity of subsequent multimodal adaptation
- Extracted the core multimodal processing pipeline from Llava15ChatHandler into a generic MTMDChatHandler base class, separating pipeline logic from model-specific prompt formats.
- Updated all multimodal subclass handlers (e.g., Gemma3, Granite-Docling, PaddleOCR, Qwen2.5vl, Qwen3-vl, MiniCPM, GLM4.xV, LFM2-VL) to inherit from the new base class MTMDChatHandler.
- Implemented strict **kwargs validation in the baseconstructor to gracefully intercept and report unsupported parameters, significantly improving Developer Experience (DX).
- Introduced dynamic self.log_prefix (self.__class__.__name__) for accurate and consistent logging across all subclasses.
- Cleaned up redundant state-clearing, image-count logic and hardcoded print statements across subclass __call__ implementations.
- To avoid exceptions occurring when the close method is called due to initialization failure and the call to exit_stack.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469
feat: Sync llama.cpp llama/mtmd API Binding 20260301

Many thanks to @yamikumo-DSD and @roj234 for providing detailed testing and valuable suggestions.

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e4861df5fd44bb83ec2b9063ca3375759416aead...3f8f0f89a2b72ec2f9494fa5f14206591a5cde49

[0.3.29]

perf(eval): implement adaptive checkpoint intervals for hybrid models
- Dynamically scale checkpoint frequency during large prompt pre-filling (max 3 triggers per eval) to minimize I/O bottlenecks and stuttering.
- Add success validation to save_checkpoint, ensuring the last_ckpt_pos tracker is only updated when the state is successfully saved to disk/memory.
- Enhance verbose logging to track dynamic interval calculations and save failures.
fix(eval): make context shift mathematically robust and architecture-safe
- Added a memory_can_shift() pre-flight check to proactively intercept and abort gracefully on architectures that physically forbid shifting (e.g., multimodal mmproj where n_pos_per_embd > 1 or incompatible M-RoPE), preventing fatal GGML_ASSERT C++ crashes.
- Implemented dynamic mathematical bounds for n_keep and n_discard to guarantee that enough space is always freed, completely eliminating the edge-case where n_discard evaluates to 0 (causing a dead-loop when n_ctx is extremely small).
- Wrapped underlying C++ memory shift operations in a try-except block for defense-in-depth against unexpected backend failures.
- Expanded in-code documentation to clarify the arithmetic constraints and architectural limitations of the KV shift mechanism.
Add the memory_can_shift API to class LlamaContext
feat(eval): enable native context shift for hybrid/recurrent models
- Removed the RuntimeError that previously blocked context shifting for hybrid and SWA architectures.
- Delegated the shift logic to the underlying C++ backend, which automatically handles Attention KV removal and RNN pos shifting.
- Added dynamic verbose logging to clearly identify the model type (Transformer vs. Hybrid/Recurrent/SWA) during a context shift event.
fix(eval): prevent batch size from halving below 1 during KV slot exhaustion
- Added an explicit guard to break the dynamic batch downgrade loop when current_batch_size is exactly 1 and a Code 1 (No KV slot) is returned.
- Prevents the engine from executing an invalid 1 // 2 operation and generating the confusing "Halving batch size from 1 to 0" verbose log.
- Ensures the evaluation process fails fast and aborts gracefully when physical VRAM is completely depleted and no further fallback is mathematically possible.
feat(hybrid): add periodic checkpointing and adaptive batch handling
- Increase default ctx_checkpoints from 16 to 32
- Add new parameter checkpoint_interval (default: 4096) for hybrid model state snapshots
- Implement robust dynamic batch downgrade on KV cache exhaustion (status=1)
- Introduce periodic checkpoint saves during eval in hybrid mode
- Improve error handling and logging around context shifts and decoding failures
Optimization (decode): treat KV slot exhaustion (code 1) as a recoverable return value
- Updated the decode wrapper to explicitly return 1 instead of raising a RuntimeError when llama_decode indicates no KV slots are available.
- Aligned Python API behavior with the underlying C++ contract, treating code 1 as a recoverable signal rather than a fatal crash.
- Enabled upper-level caller loops (like eval) to gracefully handle VRAM fragmentation via dynamic batch halving without relying on clumsy try-except block string parsing.
- Retained strict RuntimeError exceptions for truly fatal backend failures (e.g., codes -1, -2, -3).
- Added comprehensive docstrings detailing return codes and exception scenarios.
feat(core): overhaul generate and eval for hybrid model support(Qwen3-next、Qwen3.5 etc.)
- Integrated HybridCheckpointCache into the generation loop to support state rollback for recurrent/hybrid architectures.
- Implemented Context Shift (sliding window) in eval to gracefully prevent OOM when exceeding n_ctx.
- Adapted eval to use the newly vectorized LlamaBatch.add_sequence API with dynamic logits_array configuration.
- Fixed the full prefix match bug by forcing a 1-token re-evaluation to refresh logits.
- Disabled speculative decoding for hybrid models to prevent irreversible state pollution.
- Wrapped the generation loop in a try...finally block to guarantee safe checkpoint saving.
refactor(LlamaBatch): replace set_batch with granular add_token + vectorized add_sequence
- Introduce high-performance add_token() for single-token append in generation loop
- Add flexible add_sequence() with per-token pos/seq_ids/logits arrays
- Remove old set_batch() that assumed single-seq + forced last logit
- Better support for multi-sequence and precise logit control

[0.3.28]

fix(HybridCheckpointCache): ValueError: bytes must be in range(0, 256)
feat: add HybridCheckpointCache detect support for recurrent/hybrid/SWA models
- Introduce ctx_checkpoints parameter (default 16)
- Detect recurrent / hybrid / n_swa > 0 models in init
- Automatically use HybridCheckpointCache when hybrid architecture is detected
- Properly close and clear HybridCheckpointCache in del
fix(cache): add safety guards to checkpoint restore and optimize API calls
- Replaced direct llama_cpp API calls with cached function pointers (self._get_size_ext, etc.) for better performance and consistency.
- Added sequence ID validation with verbose error logging to prevent cross-sequence contamination.
- Added strict state size validation before restoration to prevent buffer overflows and backend segmentation faults.
Remove redundant seq_id and add resource cleanup
- Removed seq_id from HybridCheckpointCache initialization to make it a stateless, global multi-sequence manager.
- Added close() and __del__() methods to safely release C++ context references and prevent memory leaks.
feat(cache): implement HybridCheckpointCache for hybrid/recurrent models Introduces a dedicated caching mechanism to support state rollback for models that cannot physically truncate their KV cache (e.g., Qwen3-Next, Qwen3.5, etc.).

Key additions and changes:
- Add HybridCheckpoint dataclass to store RNN state snapshots along with their binary data and metadata.
- Implement HybridCheckpointCache to manage sequence-specific states using the llama_state_seq_*_ext C++ APIs.
- Introduce _hash_prefix using SHA-256 to guarantee cryptographic certainty when matching prompt histories, preventing state corruption.
- Add save_checkpoint with a FIFO eviction policy to strictly bound memory usage based on max_checkpoints.
- Add restore_checkpoint to securely inject valid RNN states back into the C++ backend.
- Explicitly disable incompatible dictionary interfaces (__getitem__, __setitem__, __contains__) inherited from BaseLlamaCache.
- Refactor module imports (alphabetical sorting) and relocate LlamaDiskCache for better structural consistency.
Remove the hack code in llama_chat_format.py
LLama: Optimize KV cache management for multi-round conversations
- Implements prefix-matching logic to truncate stale "ghost" tokens in C++ KV cache
- Prevents attention misalignment and context poisoning during multi-turn interactions
- Reduces memory overhead by reusing matched prefixes efficiently

[0.3.27]

feat: add PaddleOCR-VL-1.5 multimodal chat handler PaddleOCRChatHandler
fix: resolve VRAM leak in multimodal models by explicitly closing mtmd context
- Remove the ExitStack closure in Llava15ChatHandler to break circular references preventing garbage collection of the vision context.
- Implement explicit close() and __del__() methods in the chat handler to safely free mtmd_ctx.
- Integrate chat_handler.close() into the main Llama.close() lifecycle and nullify related attributes for immediate memory reclamation.
fix: resolve reload memory leaks by breaking circular references in cleanup
- Remove closure-based ExitStack callbacks in LlamaModel, LlamaContext, and LlamaBatch to eliminate circular references.
- Move explicit C-level memory freeing (llama_*_free) directly into close() methods.
- Nullify additional attributes (tokenizer_, model_params, etc.) in Llama.close() to guarantee immediate garbage collection during unload.
- Add todo comment to LoRA process logic, the current LoRa loading logic is outdated and needs to be refactored.
feat: add memory breakdown and sampler performance timings APIs
feat: Search system paths for shared libraries(by @benniekiss)
Optimize longest_token_prefix to use zero-copy NumPy arrays and drop .tolist() overhead
Free _candidates and large numpy arrays during explicit close()
fix: resolve memory leaks in sampling context lifecycle
- Safely close temporary LlamaSamplingContext in sample() using a try-finally block.
- Explicitly release the previous _sampling_ctx in generate() before re-assignment to prevent orphaned pointers.
- Ensure _sampling_ctx is properly freed in Llama.close().
Fix custom sampler memory cleanup and improve lifecycle management
- Add explicit close() and __del__() to CustomSampler to safely free C resources and break Python reference cycles
- Ensure custom samplers are properly detached and freed in LlamaSampler.close()
- Add minor documentation comments for clarity
feat: Update llama.cpp to ggml-org/llama.cpp/commit/cacc371f99fb3b5b431d3fa89ac0c752bbd62a3b
feat: Sync llama.cpp llama/mtmd API Binding 20260223

More information see: https://github.com/JamePeng/llama-cpp-python/compare/3d0fd1b75ee564361a4babf21f88855225ba1fe0...1f8341ee74a2fea15a1008487cbecaccf918c755

[0.3.26]

perf(llama-cpp): optimize LlamaTokenDataArray memory operations
- Cache NumPy field views for 'id', 'logit', and 'p' to bypass expensive property lookups.
- Refactor copy_logits to use pre-generated ID sequences and cached views.
- Ensure logical consistency by resetting token IDs every sampling step to counter C++ reordering.
- Minimize redundant memory allocations during the inference loop.
feat: Add explicit memory cleanup for sampling contexts
- Implements close() and __del__ for LlamaTokenDataArray and expands LlamaSamplingContext cleanup.
- Ensures NumPy views and internal C-references are properly released to allow Python GC to reclaim memory.
optimize(memory): reduce scores buffer size and optimize state saving
- Update save_state and load_state API use.
- Refactored self.scores to allocate only a single row (1, n_vocab) when logits_all=False, significantly reducing memory usage for large vocabulary models.
- Optimized save_state to eliminate redundant memory allocations and copies by using ctypes.string_at.
- Updated load_state, eval, and sampler adapters to correctly handle the dynamic shape of self.scores.
Fix CMake install layout to avoid top-level bin directory in site-packages
ggml: Load ggml library from candidate path list
- Auto-select lib/ or bin/ directories
- Add backend loading functions
feat(loader): extend default library search paths on Linux and macOS
- load_shared_library to include a path list feature (allowing you to add custom paths in addition to the default ones). You can later add your own paths to the libggml_base_paths candidate list in _ggml.py, such as those not commonly used Python paths.
- fix: Enhance the handling logic for non-existent file paths.
- Improves reliability of shared library discovery for system-wide installations.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/abb9f3c42b5e6acee9e8e37836ef691d1a41bdb8
feat: Sync llama.cpp llama/mtmd API Binding 20260219

More information see: https://github.com/JamePeng/llama-cpp-python/compare/b03224b2dde3c8cbdd8bf529794e3a41ac7f5751...6b38182a17effd0cedbf8736bee2464dd6c7b4da

[0.3.25]

feat: Refactor Llama class to use new LlamaSampler chain API from _internals This commit refactors the high-level Llama class to fully utilize the new C++ llama_sampler chain architecture via LlamaSamplingContext.
- Replaced manual sampling logic and obsolete _init_sampler with LlamaSamplingContext.
- Updated sample() and generate() to support the full suite of modern sampling strategies (DRY, XTC, Adaptive-P, Infill, etc.).
- Added new sampling parameters to all generation methods (create_completion, create_chat_completion, __call__):
- dynatemp_range, dynatemp_exponent (Dynamic Temperature)
- min_keep
- Refactored logits_processor handling to use CustomSampler adapter for better performance and C++ interop.
- Improved sampling state management (e.g., repetition penalties) by persisting _sampling_ctx during generation.
- Removed manual logit_bias processing in Python; now delegated to the underlying sampler chain.
feat: Separate the grammar sampler, improve the code stability of Sampler Chain processing, and fix some bugs.
Improve sampling and grammar lifecycle management, fix memory growth issues
- Validate grammar sampler initialization and inputs
- Replace unbounded prev token list with bounded deque by LlamaSamplingParams n_prev param
- Reuse logits NumPy view to avoid repeated allocations
- Reuse single-token buffers for grammar rejection sampling
- Minor cleanups and consistency improvements in sampling flow
feat: Fix sampling history alignment with llama.cpp
test: update integration tests for new sampler architecture
test: replace unstable grammar test with deterministic mechanism check
fix: Optimize .gitignore and add macOS system files
feat: Refactor the build-wheels-metal.yaml
feat: Update llama.cpp to ggml-org/llama.cpp/commit/079feab9e3efee1d6d4ca370eac50f156e2dc6e8
feat: Sync llama.cpp llama/mtmd API Binding 20260214

More information see: https://github.com/JamePeng/llama-cpp-python/compare/4ab182382b87bbbba4fb05ff184b557414740103...dc5f7e5564dd68af9d62f7d450cda45313f80b5d

[0.3.24]

feat: Refactor sampling infrastructure to use llama.cpp sampler chain API
- LlamaContext: Remove obsolete manual sampling methods.
- LlamaSampler: Wrap C++ sampler chain; add support for DRY, XTC, Adaptive-P, and lazy grammar.add_grammar merged the processing branches for regular grammar and lazy grammar.
- LlamaSamplingContext: Update to build and manage sampler chains instead of manual logic.
- CustomSampler: Rewrite for proper C-struct lifecycle management and ABI compatibility.
- Optimizations: Simplified redundant handling and variable usage.
feat: Optimize the method definition of class llama_sampler_i and CtypesPointer definition
feat: refactor LlamaSamplingParams class
- base on llama.cpp/common/common.h
- support more sampler params
feat: implement generative reranking with chat template support
- Chat Template Support: Support for rerank templates has been introduced (via llama_model_chat_template(model, b"rerank")), which can automatically populate the query and document into a specific format.
- Now support Qwen3-Reranker series model(Non-Vision)
- Update README.md for Qwen3-Reranker
feat: Update README.md for python 3.14 and HIP (ROCm) guide
feat: Update llama.cpp to ggml-org/llama.cpp/commit/b83111815e9a79949257e9d4b087206b320a3063
feat: Sync llama.cpp llama/mtmd API Binding 20260208

More information see: https://github.com/JamePeng/llama-cpp-python/compare/9b985b75105ea9f5e7f2f5e7988a1149f233af2f...acb7efa09293cffa5912242c372b715555e2a884

[0.3.23]

feat: Implement MiniCPMv45ChatHandler for MiniCPM-V 4.5 with multi-image tracking
feat: Add python 3.14 support and pin numpy version(1.21.4-2.3.2) for compatibility
feat: Increased the n_batch parameter in Llama model initialization from 512 to 2048. Slightly improves the speed of some multimodal image processing.
fix: catch TemplateSyntaxError when parsing metadata chat templates
- Some models (e.g., LLaVA 1.5) contain non-standard Jinja2 tags (like {% generation %}) in their metadata.
- This commit adds a try-except block to prevent initialization crashes, allowing the model to load even if the metadata template is invalid.
feat: enhance default system prompt with strong multimodal + same-language capabilities
feat: Better Llava1.5 Chat Format
ci: Customize wheel filename to improve version identification
- Parse generated wheel filenames in the build step.
- Append CUDA version (cuXXX) and AVX level (basic/avx2) to the version string.
- New format: package-ver+cuXXX.avxver-pyver-abiver-platform.whl (e.g., llama_cpp_python-0.3.23+cu130.basic-cp310-cp310-win_amd64.whl).
feat: Update llama.cpp to ggml-org/llama.cpp/commit/b33df266d0a53f800c47513386920cff1019d70e
feat: Sync llama.cpp llama/mtmd API Binding 20260129

[0.3.22]

perf (TFFT): Optimize longest_token_prefix with Numpy SIMD and fast-fail probe
- Vectorization: Replaced standard Python zip loop with Numpy SIMD comparison for high-performance context matching.
- Fast Exit: Added an O(1) probe check (a[0] != b[0]) to eliminate Numpy conversion overhead on mismatches. Making the "Time To First Token" virtually instantaneous for cached sessions.
- Memory Optimization: Only the intersection of the two sequences ([:min_len]) is converted to Numpy arrays, minimizing memory allocation.
- Result: Achieved ~5x speedup (129ms -> 25ms) in KV cache reuse scenarios and ~2.5x speedup (554.23ms -> 201.62ms) in load time while maintaining stability on long contexts.
- The comparative test results are here: JamePeng#47 (comment)
- This change significantly reduces latency in RAG and chat applications on long contexts.
feat: Add support for adaptive_p and infill samplers and optimize the sampler logic.
feat: Add initialization checks to the Encoder-Decoder architecture.
feat: Update llama.cpp to ggml-org/llama.cpp/commit/10c98cbdf623d982f7491e8de5711e916a913192
feat: Sync llama.cpp llama/mtmd API Binding 20260116

[0.3.21]

perf: optimize tokenization and detokenization logic
- Refactor tokenize, token_to_piece, and detokenize methods in _internals.py to significantly reduce Python loop overhead and improve the batch-processing performance and stability of load/prompt-eval.
- Key changes:
  - Replace token_to_piece O(N) Python loops in detokenize with llama.cpp native batch C-API (llama_detokenize).
  - Implement dynamic buffer allocation to safely handle arbitrary token lengths (removing the hardcoded 32-byte limit).
  - Add automatic buffer resizing for tokenize to prevent overflow errors.
- Performance observations (based on simple benchmarks):
  - Small Batch Processing (127 tokens): Latency reduced from ~117ms to ~37ms (approx. 3.1x speedup in processing loop).
  - Large Batch Processing (2420 tokens): Throughput improved from ~6905 t/s to ~8258 t/s.
  - General Latency: Total execution time for standard chat scenarios reduced by ~1.1s (from 8.4s to 7.3s).
  - The comparative test results are here: JamePeng#47 (comment)
feat: Add Granite-Docling multimodel support with GraniteDoclingChatHandler
feat: Update llama.cpp to ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7
feat: Sync llama.cpp llama/mtmd API Binding 20260110

[0.3.20]

feat: Update llama.cpp to ggml-org/llama.cpp/commit/cef1d23c5a33156c44a206c1f4bc146f4db729f9
feat: Update llama_context_params and fixed some embeddings typo
docker(cuda_simple): upgrade to CUDA 12.8.1 and switch to source install
update docker/README.md for cuda simple
Small fix for Llava15ChatHandler class

[0.3.19]

feat: Update llama.cpp to ggml-org/llama.cpp/commit/9b8329de7a7200385aaac16ab4a2ab79ae14d829
feat: Sync llama.cpp llama/mtmd API Binding 20251230
refactor: Extract embedding logic to LlamaEmbedding class, Rerank support and fix parallel batching
- Decoupled embedding and rerank logic into llama_embedding.py.
- Implemented streaming batching for constant memory usage.
- Fixed parallel batching errors by enabling kv_unified. such as "multiple embeddings in a single call"
- Added native rank() support for Reranker models.
- Added advanced normalization support (Euclidean, Taxicab, MaxInt16).
- Added array,json+ output format for raw vector access.
- The legacy embedding implementation in llama.py is now superseded by this optimized approach.
update README.md here: https://github.com/JamePeng/llama-cpp-python?tab=readme-ov-file#embeddings--reranking-gguf
refactor(LlamaBatch): enhance safety checks and fix indexing logic
refactor(Llama): enhance error handling and cleanup in eval method
Fixed a small bug in the Qwen3-VL chat template (by @alcoftTAO)
fix(Llama): implement fallback to full cache clear in eval
- If memory_seq_rm fails to clear from current_pos, fallback to clearing the entire sequence to prevent invalid input batch errors.

More information see: https://github.com/JamePeng/llama-cpp-python/compare/2efaa346bc0aa0d6648938a0dcdf8d12240a8bed...aa653ea5c6a90505a7491e855cc16988293cedd5

[0.3.18]

feat: Update llama.cpp to ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787
feat: Sync llama.cpp llama/mtmd API Binding 20251215
feat: implement GLM46VChatHandler for GLM-4.6V Series Multimodel
feat: implement LFM2VLChatHandler for LFM2-VL series Multimodel
feat: implement GLM41VChatHandler for GLM-4.1V-9B-Thinking Multimodel
workflow: Added workflows for compiling with CUDA 13.0.2 on Windows and Linux.
feat: Added the scan path for CUDA 13.0+ dynamic link libraries under Windows system ($env:CUDA_PATH\bin\x64)
Optimization: Improved batch token processing logic in Llava15ChatHandler.
perf: optimize LlamaModel.metadata reading performance
- Increase initial buffer size to 16KB to eliminate re-allocations for large chat templates.
- Cache ctypes function references to reduce loop overhead.
- Repeated model loading can result in a cumulative speed improvement of 1-3%.
build: Improve CMakeLists target logic
refactor: optimize LlamaGrammar class code

More information see: https://github.com/JamePeng/llama-cpp-python/compare/67421d546ddcaa07678ac7921a9f124e7e3de10e...d5131e2ff41e05f83fd847052b06938c7a551a6a

[0.3.17]

feat: Update llama.cpp to ggml-org/llama.cpp/commit/054a45c3d313387a4becd5eae982285932852b35
feat: Sync llama.cpp llama/mtmd API Binding 20251121
feat: Support clip flash-attn
feat: 0day support Qwen3VLChatHandler into llama_chat_format.py
Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: Better Qwen3VL chat template. (by @alcoftTAO)
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie, improves retrieval speed at least 40x compared to the original linear scan method of finding the longest prefix , thereby enhancing service responsiveness.
feat: Update Llava15ChatHandler to accept use_gpu, image_min_tokens, and image_max_tokens.Now can pass theimage_min_tokensparameter in Qwen3VLChatHandler to support bbox grounding tasks.
feat: Add Pillow process code in _load_image for VLM: that can reliably handle common formats, Supports 20+ image formats (PNG, JPEG, WebP, AVIF, BMP, ICO, TIFF, etc.). Images with alpha channel (PNG, WebP, etc.) → automatically composites on white/black background(white for dark content, black for bright content). Support image format see here: image-file-formats
feat: Optimize CUDA Wheel Build Workflow, now workflow action support python3.10-3.13 cu124-cu126-cu128 Basic(Non AVX)-AVX2 win-linux

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e5392b52036bd2770ece5269352f5600a8db5639...fbb0ed2f089c663a5eb75aadcad08f768041ed72

[0.3.16]

feat: Update llama.cpp to ggml-org/llama.cpp/commit/5e6229a8409ac786e62cb133d09f1679a9aec13e
fix: Improve the Qwen25VLChatHandler's ability to handle multiple image inputs to avoid the illusion of multiple image inputs

[0.3.15]

feat: Update llama.cpp to ggml-org/llama.cpp@cd6983d56d2cce94ecb86bb114ae8379a609073c
feat: Add strftime_now function into Jinja2ChatFormatter Class for gpt-oss and gemma3 chat_template need

[0.3.14]

feat: Update llama.cpp to ggml-org/llama.cpp@79e0b68c178656bb0632cb8602d2940b755077f8

[0.3.13]

feat: Update llama.cpp to ggml-org/llama.cpp@21c021745d781edf9c44b4972ef6cbbf53b0ecff
fix: Better chat format for Qwen2.5-VL by @alcoftTAO in #2040

[0.3.12]

feat: Update llama.cpp to ggml-org/llama.cpp@b8eeb8741d4483daf576498cf90537b4de71633c

[0.3.11]

fix: Update reference to llama_kv_cache_clear in Llama.embed. Closes #2037 by @abetlen in 9e5a4eaa84156084ed7bbb91e6efcc91dc6217bc

[0.3.10]

feat: Update llama.cpp to ggerganov/llama.cpp@28657a8229b5adc6028cf1c4ed62191792d2fdb0
feat: Add support for llama.cpp multimodal, add Qwen2.5-VL chat handler by @abetlen in cd548bd0f14210627798237d5c2ea78acfb88ccb

[0.3.9]

feat: Update llama.cpp to ggerganov/llama.cpp@8733e0cf6eefc7c7752297cc22d0836706f4222c
Make building mtmd optional by setting CMAKE_ARGS="-DMTMD_BUILD=OFF" and using MTMD_CPP_LIB to specify alternative path to shared library

[0.3.8]

feat: Update llama.cpp to ggerganov/llama.cpp@7841fc723e059d1fd9640e5c0ef19050fcc7c698

[0.3.7]

feat: Update llama.cpp to ggerganov/llama.cpp@794fe23f29fb40104975c91fe19f23798f7c726e
fix(ci): Fix the CUDA workflow by @oobabooga in #1894
fix: error showing time spent in llama perf context print, adds no_perf flag to Llama class by @shakalaca in #1898

[0.3.6]

feat: Update llama.cpp to ggerganov/llama.cpp@f7cd13301c2a88f97073fd119072b4cc92c08df1
fix(server): streaming resource lock by @gjpower in #1879

[0.3.5]

feat: Update llama.cpp to ggerganov/llama.cpp@26a8406ba9198eb6fdd8329fa717555b4f77f05f
fix(ci): Fix release by updating macos runner image to non-deprecated version by @abetlen in afedfc888462f9a6e809dc9455eb3b663764cc3f
fix(server): add missing await statements for async exit_stack handling by @gjpower in #1858

[0.3.4]

fix(ci): Build wheels for macos 13-15, cuda 12.1-12.4 by @abetlen in ca808028bd16b8327bd84128d48015a4b1304690

[0.3.3]

feat: Update llama.cpp to ggerganov/llama.cpp@ce8784bdb153ff7794dde5a50b0ebfa51baa6171
fix: chat API logprobs format by @domdomegg in #1788
feat: Add support for CUDA 12.6, fix CUDA 12.5 by @Smartappli in #1775
fix: Make content not required in ChatCompletionRequestAssistantMessage by @feloy in #1807
fix: Fix pickling of Llama class by setting seed from _seed member by @abetlen in 2523472c3eccb9ab9277117cc4ff705212b6888a
fix: Fix logit-bias type hint by @ddh0 in #1802
fix(server): Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context by @gjpower in #1798
fix(server): Added missing exit_stack.close() to /v1/chat/completions by @Ian321 in #1796
fix(examples): Refactor Batching notebook to use new sampler chain API by @lukestanley in #1793
fix(docs): Update development instructions by @Florents-Tselai in #1833
fix(docs): Remove ref to llama_eval in llama_cpp.py docs by @richdougherty in #1819

[0.3.2]

feat: Update llama.cpp to ggerganov/llama.cpp@74d73dc85cc2057446bf63cc37ff649ae7cebd80

[0.3.1]

feat: Update llama.cpp to ggerganov/llama.cpp@c919d5db39c8a7fcb64737f008e4b105ee0acd20
feat: Expose libggml in internal APIs by @abetlen in #1761
fix: Fix speculative decoding by @abetlen in 9992c5084a3df2f533e265d10f81d4269b97a1e6 and e975dabf74b3ad85689c9a07719cbb181313139b
misc: Rename all_text to remaining_text by @xu-song in #1658

[0.3.0]

feat: Update llama.cpp to ggerganov/llama.cpp@ea9c32be71b91b42ecc538bd902e93cbb5fb36cb
feat: Enable detokenizing special tokens with special=True by @benniekiss in #1596
feat(ci): Speed up CI workflows using uv, add support for CUDA 12.5 wheels by @Smartappli in e529940f45d42ed8aa31334123b8d66bc67b0e78
feat: Add loading sharded GGUF files from HuggingFace with Llama.from_pretrained(additional_files=[...]) by @Gnurro in 84c092063e8f222758dd3d60bdb2d1d342ac292e
feat: Add option to configure n_ubatch by @abetlen in 6c44a3f36b089239cb6396bb408116aad262c702
feat: Update sampling API for llama.cpp. Sampling now uses sampler chain by @abetlen in f8fcb3ea3424bcfba3a5437626a994771a02324b
fix: Don't store scores internally unless logits_all=True. Reduces memory requirements for large context by @abetlen in 29afcfdff5e75d7df4c13bad0122c98661d251ab
fix: Fix memory allocation of ndarray in by @xu-song in #1704
fix: Use system message in og qwen format by @abetlen in 98eb092d3c6e7c142c4ba2faaca6c091718abbb3

[0.2.90]

feat: Update llama.cpp to ggerganov/llama.cpp@1d1ccce67613674c75c9c7e3fa4c1e24e428ba48
feat: Add support for MiniCPMv26ChatHandler and minicpm-v-26 in server by @abetlen in f70df824985d875226793b94dacc0c302a4256b2

[0.2.89]

feat: Update llama.cpp to ggerganov/llama.cpp@cfac111e2b3953cdb6b0126e67a2487687646971
fix: Llama.close didn't free lora adapter by @jkawamoto in #1679
fix: missing dependencies for test by @jkawamoto in #1680

[0.2.88]

feat: Update llama.cpp to ggerganov/llama.cpp@fc4ca27b25464a11b3b86c9dbb5b6ed6065965c2
fix: only print 'cache saved' in verbose mode by @lsorber in #1668
fix: Added back from_file method to LlamaGrammar by @ExtReMLapin in #1673
fix: grammar prints on each call by @abetlen in 0998ea0deea076a547d54bd598d6b413b588ee2b
feat: Enable recursive search of HFFS.ls when using from_pretrained by @benHeidabetlen in #1656
feat: Add more detailed log for prefix-match by @xu-song in #1659

[0.2.87]

feat: Update llama.cpp to ggerganov/llama.cpp@be55695eff44784a141a863f273661a6bce63dfc
fix: Include all llama.cpp source files and subdirectories by @abetlen in 9cad5714ae6e7c250af8d0bbb179f631368c928b
feat(ci): Re-build wheel index automatically when releases are created by @abetlen in 198f47dc1bd202fd2b71b29e041a9f33fe40bfad

[0.2.86]

feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
feat: Ported back new grammar changes from C++ to Python implementation by @ExtReMLapin in (#1637)
fix: llama_grammar_accept_token arg order by @tc-wolf in (#1649)

[0.2.85]

feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
fix: Missing LoRA adapter after API change by @shamitv in #1630
fix(docker): Update Dockerfile BLAS options by @olivierdebauche in #1632
fix(docker): Fix GGML_CUDA param by @olivierdebauche in #1633
fix(docker): Update Dockerfile build options from LLAMA_ to GGML_ by @olivierdebauche in #1634
feat: FreeBSD compatibility by @yurivict in #1635

[0.2.84]

feat: Update llama.cpp to ggerganov/llama.cpp@4730faca618ff9cee0780580145e3cbe86f24876
fix: fix: Correcting run.sh filepath in Simple Docker implementation by @mashuk999 in #1626

[0.2.83]

feat: Update llama.cpp to ggerganov/llama.cpp@081fe431aa8fb6307145c4feb3eed4f48cab19f8
feat: Add 'required' literal to ChatCompletionToolChoiceOption by @mjschock in #1597
fix: Change repeat_penalty to 1.0 to match llama.cpp defaults by @ddh0 in #1590
fix(docs): Update README.md typo by @ericcurtin in #1589
fix(server): Use split_mode from model settings by @grider-withourai in #1594
feat(ci): Dockerfile update base images and post-install cleanup by @Smartappli in #1530

[0.2.82]

feat: Update llama.cpp to ggerganov/llama.cpp@7fdb6f73e35605c8dbc39e9f19cd9ed84dbc87f2

[0.2.81]

feat: Update llama.cpp to ggerganov/llama.cpp@968967376dc2c018d29f897c4883d335bbf384fb
fix(ci): Fix CUDA wheels, use LLAMA_CUDA instead of removed LLAMA_CUBLAS by @abetlen in 4fb6fc12a02a68884c25dd9f6a421cacec7604c6
fix(ci): Fix MacOS release, use macos-12 image instead of removed macos-11 by @abetlen in 3a551eb5263fdbd24b36d7770856374c04e92788

[0.2.80]

feat: Update llama.cpp to ggerganov/llama.cpp@023b8807e10bc3ade24a255f01c1ad2a01bb4228
fix(server): Fix bug in FastAPI streaming response where dependency was released before request completes causing SEGFAULT by @abetlen in 296304b60bb83689659883c9cc24f4c074dd88ff
fix(server): Update default config value for embeddings to False to fix error in text generation where logits were not allocated by llama.cpp by @abetlen in bf5e0bb4b151f4ca2f5a21af68eb832a96a79d75
fix(ci): Fix the CUDA workflow by @oobabooga in #1551
docs: Update readme examples to use newer Qwen2 model by @jncraton in #1544

[0.2.79]

feat: Update llama.cpp to ggerganov/llama.cpp@9c77ec1d74874ee22bdef8f110e8e8d41389abf2
feat(ci): Update workflows and pre-built wheels by @Smartappli in #1416
feat: Add .close() method to Llama class to explicitly free model from memory by @jkawamoto in #1513
feat: Support SPM infill by @CISC in #1492

[0.2.78]

feat: Update llama.cpp to ggerganov/llama.cpp@fd5ea0f897ecb3659d6c269ef6f3d833e865ead7
fix: Avoid duplicate special tokens in chat formats by @CISC in #1439
fix: fix logprobs when BOS is not present by @ghorbani in #1471
feat: adding rpc_servers parameter to Llama class by @chraac in #1477

[0.2.77]

feat: Update llama.cpp to ggerganov/llama.cpp@bde7cd3cd949c1a85d3a199498ac98e78039d46f
fix: string value kv_overrides by @abetlen in df45a4b3fe46e72664bda87301b318210c6d4782
fix: Fix typo in Llama3VisionAlphaChatHandler by @abetlen in 165b4dc6c188f8fda2fc616154e111f710484eba
fix: Use numpy recarray for candidates data, fixes bug with temp < 0 by @abetlen in af3ed503e9ce60fe6b5365031abad4176a3536b3 fix: Disable Windows+CUDA workaround when compiling for HIPBLAS by Engininja2 in #1493

[0.2.76]

feat: Update llama.cpp to ggerganov/llama.cpp@0df0aa8e43c3378975269a51f9b876c8692e70da
feat: Improve Llama.eval performance by avoiding list conversion by @thoughtp0lice in #1476
example: LLM inference with Ray Serve by @rgerganov in #1465

[0.2.75]

feat: Update llama.cpp to ggerganov/llama.cpp@13ad16af1231ab2d245d35df3295bcfa23de1305
fix: segfault for models without eos / bos tokens by @abetlen in d99a6ba607a4885fb00e63e967964aa41bdbbbcb
feat: add MinTokensLogitProcessor and min_tokens argument to server by @twaka in #1333
misc: Remove unnecessary metadata lookups by @CISC in #1448

[0.2.74]

feat: Update llama.cpp to ggerganov/llama.cpp@b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2
fix: Enable CUDA backend for llava by @abetlen in 7f59856fa6f3e23f07e12fc15aeb9359dc6c3bb4
docs: Fix typo in README.md by @yupbank in #1444

[0.2.73]

feat: Update llama.cpp to ggerganov/llama.cpp@25c6e82e7a1ad25a42b0894e87d9b5c557409516
fix: Clear kv cache at beginning of image chat formats to avoid bug when image is evaluated first by @abetlen in ac55d0a175115d1e719672ce1cb1bec776c738b1

[0.2.72]

fix(security): Remote Code Execution by Server-Side Template Injection in Model Metadata by @retr0reg in b454f40a9a1787b2b5659cd2cb00819d983185df
fix(security): Update remaining jinja chat templates to use immutable sandbox by @CISC in #1441

[0.2.71]

feat: Update llama.cpp to ggerganov/llama.cpp@911b3900dded9a1cfe0f0e41b82c7a29baf3a217
fix: Make leading bos_token optional for image chat formats, fix nanollava system message by @abetlen in 77122638b4153e31d9f277b3d905c2900b536632
fix: free last image embed in llava chat handler by @abetlen in 3757328b703b2cd32dcbd5853271e3a8c8599fe7

[0.2.70]

feat: Update llama.cpp to ggerganov/llama.cpp@c0e6fbf8c380718102bd25fcb8d2e55f8f9480d1
feat: fill-in-middle support by @CISC in #1386
fix: adding missing args in create_completion for functionary chat handler by @skalade in #1430
docs: update README.md @eltociear in #1432
fix: chat_format log where auto-detected format prints None by @balvisio in #1434
feat(server): Add support for setting root_path by @abetlen in 0318702cdc860999ee70f277425edbbfe0e60419
feat(ci): Add docker checks and check deps more frequently by @Smartappli in #1426
fix: detokenization case where first token does not start with a leading space by @noamgat in #1375
feat: Implement streaming for Functionary v2 + Bug fixes by @jeffrey-fong in #1419
fix: Use memmove to copy str_value kv_override by @abetlen in 9f7a85571ae80d3b6ddbd3e1bae407b9f1e3448a
feat(server): Remove temperature bounds checks for server by @abetlen in 0a454bebe67d12a446981eb16028c168ca5faa81
fix(server): Propagate flash_attn to model load by @dthuerck in #1424

[0.2.69]

feat: Update llama.cpp to ggerganov/llama.cpp@6ecf3189e00a1e8e737a78b6d10e1d7006e050a2
feat: Add llama-3-vision-alpha chat format by @abetlen in 31b1d95a6c19f5b615a3286069f181a415f872e8
fix: Change default verbose value of verbose in image chat format handlers to True to match Llama by @abetlen in 4f01c452b6c738dc56eacac3758119b12c57ea94
fix: Suppress all logs when verbose=False, use hardcoded fileno's to work in colab notebooks by @abetlen in f116175a5a7c84569c88cad231855c1e6e59ff6e
fix: UTF-8 handling with grammars by @jsoma in #1415

[0.2.68]

feat: Update llama.cpp to ggerganov/llama.cpp@77e15bec6217a39be59b9cc83d6b9afb6b0d8167
feat: Add option to enable flash_attn to Lllama params and ModelSettings by @abetlen in 22d77eefd2edaf0148f53374d0cac74d0e25d06e
fix(ci): Fix build-and-release.yaml by @Smartappli in #1413

[0.2.67]

fix: Ensure image renders before text in chat formats regardless of message content order by @abetlen in 3489ef09d3775f4a87fb7114f619e8ba9cb6b656
fix(ci): Fix bug in use of upload-artifact failing to merge multiple artifacts into a single release by @abetlen in d03f15bb73a1d520970357b702a9e7d4cc2a7a62

[0.2.66]

feat: Update llama.cpp to ggerganov/llama.cpp@8843a98c2ba97a25e93319a104f9ddfaf83ce4c4
feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) by @abetlen in #1147
ci(fix): Workflow actions updates and fix arm64 wheels not included in release by @Smartappli in #1392
ci: Add support for pre-built cuda 12.4.1 wheels by @Smartappli in #1388
feat: Add support for str type kv_overrides by @abetlen in a411612b385cef100d76145da1fbd02a7b7cc894
fix: Functionary bug fixes by @jeffrey-fong in #1385
examples: fix quantize example by @iyubondyrev in #1387
ci: Update dependabot.yml by @Smartappli in #1391

[0.2.65]

feat: Update llama.cpp to ggerganov/llama.cpp@46e12c4692a37bdd31a0432fc5153d7d22bc7f72
feat: Allow for possibly non-pooled embeddings by @iamlemec in #1380

[0.2.64]

feat: Update llama.cpp to ggerganov/llama.cpp@4e96a812b3ce7322a29a3008db2ed73d9087b176
feat: Add llama-3 chat format by @andreabak in #1371
feat: Use new llama_token_is_eog in create_completions by @abetlen in d40a250ef3cfaa8224d12c83776a2f1de96ae3d1
feat(server): Provide ability to dynamically allocate all threads if desired using -1 by @sean-bailey in #1364
ci: Build arm64 wheels by @gaby in 611781f5319719a3d05fefccbbf0cc321742a026
fix: Update scikit-build-core build dependency avoid bug in 0.9.1 by @evelkey in #1370

[0.2.63]

feat: Update llama.cpp to ggerganov/llama.cpp@0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct by @abetlen in cc81afebf04d26ca1ac3cf72f23f18da6ab58588

[0.2.62]

feat: Update llama.cpp to ggerganov/llama.cpp@3b8f1ec4b18770531d0b1d792f3edf08254e4f0c
feat: update grammar schema converter to match llama.cpp by @themrzmaster in #1353
feat: add disable_ping_events flag by @khimaros in #1257
feat: Make saved state more compact on-disk by @tc-wolf in #1296
feat: Use all available CPUs for batch processing by @ddh0 in #1345

[0.2.61]

feat: Update llama.cpp to ggerganov/llama.cpp@ba5e134e073ec6837078c874aba44a702944a676
fix: pass correct type to chat handlers for chat completion logprobs by @abetlen in bb65b4d76411112c6fb0bf759efd746f99ef3c6b
feat: Add support for yaml based server configs by @abetlen in 060bfa64d529ade2af9b1f4e207a3937bbc4138f
feat: Add typechecking for ctypes structure attributes by @abetlen in 1347e1d050fc5a9a32ffe0bb3e22858da28003bd

[0.2.60]

feat: Update llama.cpp to ggerganov/llama.cpp@75cd4c77292034ecec587ecb401366f57338f7c0
fix: Always embed metal library by @abetlen in b3bfea6dbfb6ed9ce18f9a2723e0a9e4bd1da7ad
fix: missing logprobs in response, incorrect response type for functionary by @abetlen in 1ae3abbcc3af7f4a25a3ffc40b246f18039565e8
fix(docs): incorrect tool_choice example by @CISC in #1330

[0.2.59]

feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal by @abetlen, @jllllll, and @oobabooga in #1247
fix: segfault when logits_all=False by @abetlen in 8649d7671bd1a7c0d9cc6a5ad91c6ca286512ab3
fix: last tokens passing to sample_repetition_penalties function by @ymikhailov in #1295

[0.2.58]

feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
feat: add support for KV cache quantization options by @Limour-dev in #1307
feat: Add logprobs support to chat completions by @windspirit95 in #1311
fix: set LLAMA_METAL_EMBED_LIBRARY=on on MacOS arm64 by @bretello in #1289
feat: Add tools/functions variables to Jinja2ChatFormatter, add function response formatting for all simple chat formats by @CISC in #1273
fix: Changed local API doc references to hosted by by @lawfordp2017 in #1317

[0.2.57]

feat: Update llama.cpp to ggerganov/llama.cpp@ac9ee6a4ad740bc1ee484ede43e9f92b5af244c1
fix: set default embedding pooling type to unspecified by @abetlen in 4084aabe867b8ec2aba1b22659e59c9318b0d1f3
fix: Fix and optimize functionary chat handler by @jeffrey-fong in #1282
fix: json mode for basic chat formats by @abetlen in 20e6815252d0efd9f015f7adbf108faaf36e3f3c

[0.2.56]

feat: Update llama.cpp to ggerganov/llama.cpp@c2101a2e909ac7c08976d414e64e96c90ee5fa9e
feat(server): Add endpoints for tokenize, detokenize and count tokens by @felipelo in #1136
feat: Switch embed to llama_get_embeddings_seq by @iamlemec in #1263
fix: Fixed json strings grammar by blacklisting character control set by @ExtReMLapin in d02a9cf16ff88ad011e2eb1ce29f4d9400f13cd1
fix: Check for existence of clip model path by @kejcao in #1264

[0.2.55]

feat: Update llama.cpp to ggerganov/llama.cpp@9731134296af3a6839cd682e51d9c2109a871de5
docs: fix small typo in README: 'model know how' -> 'model knows how' by @boegel in #1244

[0.2.54]

feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
docs: fix typo in README.md embeddings example by @iamlemec in #1232

[0.2.53]

feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter by @CISC in #1230

[0.2.52]

feat: Update llama.cpp to ggerganov/llama.cpp@a33e6a0d2a66104ea9a906bdbf8a94d050189d91
fix: Llava15ChatHandler (this function takes at least 4 arguments) by @abetlen in 8383a9e5620f5df5a88f62da16813eac200dd706

[0.2.51]

feat: Update llama.cpp to ggerganov/llama.cpp@c39373398803c669056304090050fe3f44b41bf9
fix: Restore type hints for low-level api by @abetlen in 19234aa0dbd0c3c87656e65dd2b064665371925b

[0.2.50]

docs: Update Functionary OpenAI Server Readme by @jeffrey-fong in #1193
fix: LlamaHFTokenizer now receives pre_tokens by @abetlen in 47bad30dd716443652275099fa3851811168ff4a

[0.2.49]

fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8' in Llama.save_state by @abetlen in db776a885cd4c20811f22f8bd1a27ecc71dba927
feat: Auto detect Mixtral's slightly different format by @lukestanley in #1214

[0.2.48]

feat: Update llama.cpp to ggerganov/llama.cpp@15499eb94227401bdc8875da6eb85c15d37068f7
feat: Add Google's Gemma formatting via chat_format="gemma" by @alvarobartt in #1210
feat: support minItems/maxItems in JSON grammar converter by @nopperl in 3921e10770996d95a9eb22c8248bacef39f69365
fix: Update from_pretrained defaults to match hf_hub_download and pull to local cache folder by @abetlen in e6d6260a91b7831733f7d1f73c7af46a3e8185ed
fix: Raise exceptions when llama model or context fails to load by @abetlen in dd22010e85265ae840c76ec835d67a29ed852722
docs: Update README.md to fix pip install llama cpp server by @audip in #1187

[0.2.47]

feat: Update llama.cpp to ggerganov/llama.cpp@973053d8b0d04809836b3339a50f68d9c842de90

[0.2.46]

feat: Update llama.cpp to ggerganov/llama.cpp@ba2135ccae7462470b3865c6e41d2e1d734eac05
feat: Pull models directly from huggingface by @abetlen in #1206
feat(low-level-api): Improve API static type-safety and performance. Low level api functions are positional args only now. by @abetlen in #1205

[0.2.45]

feat: Update llama.cpp to ggerganov/llama.cpp@89febfed9322c8849520dc63c93ee4f5fd72556e

[0.2.44]

feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051
fix: create_embedding broken response for input type str by @abetlen in 0ce66bc080fe537590b05b24bf442480bf2dd045
fix: Use '\n' seperator for EventSourceResponse by @khimaros in #1188
fix: Incorporate embedding pooling layer fixes by @iamlemec in #1194

[0.2.43]

feat: Update llama.cpp to ggerganov/llama.cpp@8084d554406b767d36b3250b3b787462d5dd626f
feat: Support batch embeddings by @iamlemec in #1186
fix: submodule kompute is not included in sdist by @abetlen in 7dbbfdecadebe7750be650d9409959640ff9a460
fix: fix: Update openbuddy prompt format by @abetlen in 07a783779a62a4aac0b11161c7e0eb983ff215f8

[0.2.42]

feat: Update llama.cpp to ggerganov/llama.cpp@ea9c8e11436ad50719987fa23a289c74b7b40d40
fix: sample idx off-by-one error for logit_processors by @lapp0 in #1179
fix: chat formatting bugs in chatml-function-calling by @abetlen in 4b0e3320bd8c2c209e29978d0b21e2e471cc9ee3 and 68fb71b6a26a1e57331868f959b47ab4b87851e1

[0.2.41]

feat: Update llama.cpp to ggerganov/llama.cpp@895407f31b358e3d9335e847d13f033491ec8a5b
fix: Don't change order of json schema object properties in generated grammar unless prop_order is passed by @abetlen in d1822fed6b706f38bd1ff0de4dec5baaa3cf84fa

[0.2.40]

feat: Update llama.cpp to ggerganov/llama.cpp@3bdc4cd0f595a6096cca4a64aa75ffa8a3503465
feat: Generic chatml Function Calling using chat_format="chatml-function-calling"` by @abetlen in #957
fix: Circular dependancy preventing early Llama object free by @notwa in #1176
docs: Set the correct command for compiling with syscl support by @akarshanbiswas in #1172
feat: use gpu backend for clip if available by @iamlemec in #1175

[0.2.39]

feat: Update llama.cpp to ggerganov/llama.cpp@b08f22c882a1443e6b97081f3ce718a4d1a741f8
fix: Fix destructor logging bugs by using llama_log_callback to avoid suppress_stdout_stderr by @abetlen in 59760c85eddc72dfcc1839f43760ef72c23d6874

[0.2.38]

feat: Update llama.cpp to ggerganov/llama.cpp@1cfb5372cf5707c8ec6dde7c874f4a44a6c4c915
feat: Add speculative decoding by @abetlen in #1120
fix: Pass raise_exception and add_generation_prompt to jinja2 chat template by @abetlen in 078cca0361bf5a94d2cf52ed04980d20e32d6f95

[0.2.37]

feat: Update llama.cpp to ggerganov/llama.cpp@fea4fd4ba7f6b754ac795387b275e1a014a77bde
feat: Automatically set chat format from gguf by @abetlen in #1110

[0.2.36]

feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799

[0.2.35]

feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00

[0.2.34]

feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
feat: Add json schema mode by @abetlen in #1122

[0.2.33]

feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5
fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f
fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116
fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7

[0.2.32]

feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2
fix: from_json_schema oneof/anyof bug by @jndiogo in d3f5528ca8bcb9d69d4f27e21631e911f1fb9bfe
fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats by @abetlen in 24f39454e91cf5dddbc4b6041aead4accc7c7a2d
feat: Add add_generation_prompt option for jinja2chatformatter by @abetlen in 7f3209b1eb4ad3260ba063801fab80a8c25a2f4c
feat: Add Jinja2ChatFormatter by @abetlen in be09318c26add8674ce494ae7cc480cce72a4146
feat: Expose gguf model metadata in metadata property by @abetlen in 5a34c57e5479e50c99aba9b38218cc48e6560b81

[0.2.31]

feat: Update llama.cpp to ggerganov/llama.cpp@a5cacb22b2114fd9adf61c00cbb237384d86bced
fix: Mirostat sampling now passes correct type to ctypes and tracks state during generation by @abetlen in 3babe3512cb95743108f2b595210c38ed6f1b904
fix: Python3.8 support in server by @abetlen in 141293a75b564a8699e0acba1da24d9aa1cf0ab1

[0.2.30]

feat: Update llama.cpp to ggerganov/llama.cpp@57e2a7a52a819883f40dada8a2edc24ecf48186b
feat(server): Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files by @abetlen in b8fc1c7d83ad4a9207c707ba1d954fe580286a01
feat: Integration of Jinja2 Templating for chat formats by @teleprint-me in #875
fix: Offload KQV by default by @abetlen in 48c3b77e6f558a9899de0e1155c7dc0c7958d8e8
fix: Support Accept text/event-stream in chat and completion endpoints, resolves #1083 by @aniljava in #1088
fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 by @K-Mistele in #1093

[0.2.29]

feat: Update llama.cpp to ggerganov/llama.cpp@4483396751c79dea540808b9cb9238245d06da2b
feat: Add split_mode option by @abetlen in 84615adbc6855c8384807c42f0130f9a1763f99d
feat: Implement GGUF metadata KV overrides by @phiharri in #1011
fix: Avoid "LookupError: unknown encoding: ascii" when open() called in a destructor by @yieldthought in #1012
fix: Fix low_level_api_chat_cpp example to match current API by @aniljava in #1086
fix: Fix Pydantic model parsing by @DeNeutoy in #1087

[0.2.28]

feat: Update llama.cpp to ggerganov/llama.cpp@6efb8eb30e7025b168f3fda3ff83b9b386428ad6
feat: Add ability to pass in penalize_nl param by @shankinson in #1068
fix: print_grammar to stderr by @turian in #1052

[0.2.27]

feat: Update llama.cpp to ggerganov/llama.cpp@b3a7c20b5c035250257d2b62851c379b159c899a
feat: Add saiga chat format by @femoiseev in #1050
feat: Added chatglm3 chat format by @xaviviro in #1059
fix: Correct typo in README.md by @qeleb in (#1058)

[0.2.26]

feat: Update llama.cpp to ggerganov/llama.cpp@f6793491b5af6da75edad34d6f503ef86d31b09f

[0.2.25]

feat(server): Multi model support by @D4ve-R in #931
feat(server): Support none defaulting to infinity for completions by @swg in #111
feat(server): Implement openai api compatible authentication by @docmeth2 in #1010
fix: text_offset of multi-token characters by @twaka in #1037
fix: ctypes bindings for kv override by @phiharri in #1011
fix: ctypes definitions of llama_kv_cache_view_update and llama_kv_cache_view_free. by @e-c-d in #1028

[0.2.24]

feat: Update llama.cpp to ggerganov/llama.cpp@0e18b2e7d0b5c0a509ea40098def234b8d4a938a
feat: Add offload_kqv option to llama and server by @abetlen in 095c65000642a3cf73055d7428232fb18b73c6f3
feat: n_ctx=0 now uses the n_ctx_train of the model by @DanieleMorotti in #1015
feat: logits_to_logprobs supports both 2-D and 3-D logits arrays by @kddubey in #1002
fix: Remove f16_kv, add offload_kqv fields in low level and llama apis by @brandonrobertz in #1019
perf: Don't convert logprobs arrays to lists by @kddubey in #1021
docs: Fix README.md functionary demo typo by @evelynmitchell in #996
examples: Update low_level_api_llama_cpp.py to match current API by @jsoma in #1023

[0.2.23]

Update llama.cpp to ggerganov/llama.cpp@948ff137ec37f1ec74c02905917fa0afc9b97514
Add qwen chat format by @yhfgyyf in #1005
Add support for running the server with SSL by @rgerganov in #994
Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp by @player1537 in #991
Fix UnsupportedOperation: fileno in suppress_stdout_stderr by @zocainViken in #961
Add Pygmalion chat format by @chiensen in #986
README.md multimodal params fix by @zocainViken in #967
Fix minor typo in README by @aniketmaurya in #958

[0.2.22]

Update llama.cpp to ggerganov/llama.cpp@8a7b2fa528f130631a5f43648481596ab320ed5a
Fix conflict with transformers library by kddubey in #952

[0.2.21]

Update llama.cpp to ggerganov/llama.cpp@64e64aa2557d97490b2fe1262b313e2f4a1607e3
Make building llava optional by setting CMAKE_ARGS="-DLLAVA_BUILD=OFF" and using LLAVA_CPP_LIB to specify alternative path to shared library by @abetlen in e3941d9c674dbd9891dc3ceda390daeb21f05fd1

[0.2.20]

Update llama.cpp to ggerganov/llama.cpp@b38a16dfcff88d547f78f52d1bea31b84a05aff7
Add zephyr chat format by @fakerybakery in #938
Add baichuan chat format by @caiyesd in #938
Add baichuan-2 chat format by @caiyesd in #936
Improve documentation for server chat formats by @jooray in #934
Fix typo in README by @antonvice in 940
Fix typo in the Open Orca chat format by @gardner in #947

[0.2.19]

Update llama.cpp to ggerganov/llama.cpp@0b871f1a04ef60e114bbe43004fd9c21114e802d
Fix #569: stop parameter in chat completion api should accept str by @abetlen in 128dc4731fa846ead7e684a137ca57d8931b8899
Document server host and port parameters by @jamesbraza in #768
Do not set grammar to None when initializing LlamaGrammar by @mthuurne in #834
Add mistrallite, intel, and openchat formats by @fakerybakery in #927
Add support for min_p parameter by @tk-master in #921
Fix #929: tokenizer adding leading space when generating from empty prompt by @abetlen in a34d48014192771d2e308a76c22f33bc0318d983
Fix low level api example by @zocainViken in #925
Fix missing package in openblas docker image by @ZisisTsatsas in #920

[0.2.18]

Update llama.cpp to ggerganov/llama.cpp@6bb4908a17150b49373b5f977685b2e180a04f6f

[0.2.17]

Update llama.cpp to ggerganov/llama.cpp@df9d1293defe783f42bc83af732d3c670552c541
Hotfix: Set CUDA_ARCHITECTURES=OFF for llava_shared target on Windows by @abetlen in 4388f3341413110217b98c4f097ac5c590bdf40b

[0.2.16]

Update llama.cpp to ggerganov/llama.cp@a75fa576abba9d37f463580c379e4bbf1e1ad03c
Add set_seed to Llama class by @abetlen in fd41ed3a908761d286102a019a34c2938a15118d
Fix server doc arguments by @kjunggithub in #892
Fix response_format handler in llava chat handler by @abetlen in b62c44983921197ed10a7d29dc4ba920e9979380
Fix default max_tokens, chat completion is now unlimited (to context length) and completion is 16 tokens to match OpenAI defaults by @abetlen in e7962d2c733cbbeec5a37392c81f64185a9a39e8
Fix json_schema_to_gbnf helper so that it takes a json schema string as input instead by @abetlen in faeae181b1e868643c0dc28fcf039f077baf0829
Add support for $ref and $def in json_schema_to_gbnf to handle more complex function schemas by @abetlen in 770df344369c0630df1be14be9f9e301e7c56d24
Update functionary chat handler for new OpenAI api by abetlen in 1b376c62b775b401653facf25a519d116aafe99a
Fix add default stop sequence to chatml chat format by @abetlen in b84d76a844149216d511cfd8cdb9827148a1853c
Fix sampling bug when logits_all=False by @abetlen in 6f0b0b1b840af846938ed74d0e8170a91c40e617

[0.2.15]

Update llama.cpp to ggerganov/llama.cpp@0a7c980b6f94a049cb804573df2d8092a34df8e4
Add support for Llava1.5 multimodal models by @damian0815 and @abetlen in #821
Update OpenAI API compatibility to match dev day update by @abetlen in #821
Add seed parameter to completion and chat_completion functions of Llama class by @abetlen in 86aeb9f3a14808575d2bb0076e6acb4a30907e6a
Add JSON mode support to constrain chat completion to JSON objects by @abetlen in b30b9c338bf9af316d497ea501d39f5c246900db

[0.2.14]

Update llama.cpp to ggerganov/llama.cpp@f0b30ef7dc1360922ccbea0a8cd3918ecf15eaa7
Add support for Huggingface Autotokenizer Chat Formats by @bioshazard and @abetlen in #790 and bbffdaebaa7bb04b543dbf683a07276087251f86
Fix llama-2 chat format by @earonesty in #869
Add support for functionary chat format by @abetlen in #784
Migrate inference from deprecated llama_evalAPI to llama_batch and llama_decode by @abetlen in #795

[0.2.13]

Update llama.cpp to ggerganov/llama.cpp@51b2fc11f7f605fff49725a4540e9a6ef7b51b70
Fix name 'open' is not defined exception when deleting model by @abetlen in 011b95d7f34cbfc528af75a892757bd9a20838ab
Fix tokenization of special characters by @antoine-lizee in #850

[0.2.12]

Update llama.cpp to ggerganov/llama.cpp@50337961a678fce4081554b24e56e86b67660163
Fix missing n_seq_id in llama_batch by @NickAlgra in #842
Fix for shared libraries on Windows that start with lib prefix by @sujeendran in #848
Fix exception raised in __del__ when freeing models by @cebtenzzre in #846
Performance improvement for logit bias by @zolastro in #851
Fix suffix check arbitrary code execution bug by @mtasic85 in #854
Fix typo in function_call parameter in llama_types.py by @akatora28 in #849
Fix streaming not returning finish_reason by @gmcgoldr in #798
Fix n_gpu_layers check to allow values less than 1 for server by @hxy9243 in #826
Supppress stdout and stderr when freeing model by @paschembri in #803
Fix llama2 chat format by @delock in #808
Add validation for tensor_split size by @eric1932 #820
Print stack trace on server error by @abetlen in d6a130a052db3a50975a719088a9226abfebb266
Update docs for gguf by @johnccshen in #783
Add chatml chat format by @abetlen in 305482bd4156c70802fc054044119054806f4126

[0.2.11]

Fix bug in llama_model_params object has no attribute logits_all by @abetlen in d696251fbe40015e8616ea7a7d7ad5257fd1b896

[0.2.10]

Fix bug 'llama_model_params' object has no attribute 'embedding' by @abetlen in 42bb721d64d744242f9f980f2b89d5a6e335b5e4

[0.2.9]

Fix critical bug in pip installation of v0.2.8 due to .git directory in ac853e01e1a217a578080a4e1b851d2d08450adf

[0.2.8]

Update llama.cpp to ggerganov/llama.cpp@40e07a60f9ce06e79f3ccd4c903eba300fb31b5e
Add configurable chat formats by @abetlen in #711
Fix rope scaling bug by @Josh-XT in #767
Fix missing numa parameter in server by @abetlen in d9bce17794d0dd6f7962d10aad768fedecf3ab89

[0.2.7]

Update llama.cpp to ggerganov/llama.cpp@a98b1633d5a94d0aa84c7c16e1f8df5ac21fc850
Install required runtime dlls to package directory on windows by @abetlen in 8d75016549e2ff62a511b1119d966ffc0df5c77b
Add openai-processing-ms to server response header by @Tradunsky in #748
Bump minimum version of scikit-build-core to 0.5.1 to fix msvc cmake issue by @abetlen in 1ed0f3ebe16993a0f961155aa4b2c85f1c68f668
Update llama_types.py to better match the openai api, old names are aliased to new ones by @abetlen in dbca136feaaf7f8b1182c4c3c90c32918b1d0bb3

[0.2.6]

Update llama.cpp to 80291a1d02a07f7f66666fb576c5b1e75aa48b46

[0.2.5]

Fix docker images missing starlette-context dependency by @abetlen in 22917989003c5e67623d54ab45affa1e0e475410
Fix loading dll in Windows Isolation Containers by @abetlen in 847466562573191efa655753d9252f308c4fbdb0
Fix build issue on m1 macs by @abetlen in dbd3a6d1ed8416a8fd800127251e730153afa305
Update docs to gguf and add hw acceleration docs for server by @jasonacox in #688

[0.2.4]

Add NUMA support. NOTE low level api users must call llama_backend_init at the start of their programs by abetlen in f4090a0bb2a2a25acfe28d31c82cc1aa273bedee
Fix tensor_split server cli argument by @abetlen in c4c440ba2dc86d9de728a751311fdd1c8e3756fa
Made all Llama init parameters into keyword-only parameters by @abetlen in c8f9b8a734b5b040379bbd93995ba177affab1fe
Added server params for low_vram, main_gpu, lora_base, and lora_path by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
Removed server params for rms_norm_eps and n_gqa by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
Fix boolean cli options by @abetlen in c999325e8e4507f6c6249dd2fb8de7f8bf57f71e and 0449d29b9f940e437231a07b9d56550226558bac
Silence Pydantic Settings warnings about model_alias setting by @earonesty in #705

[0.2.3]

Update llama.cpp to ggerganov/llama.cpp@71ca2fad7d6c0ef95ef9944fb3a1a843e481f314
Add X-Request-ID request header for mirroring custom IDs by @devrimcavusoglu in #703
Add pyproject extra for scikit-build-core to ensure compatible pathspec version by @abetlen in 6cfc54284b99ef1bff8193e2d5e483dbd89ada02
Fix issue with Literal and Optional cli arguments not working by @abetlen in #702

[0.2.2]

Fix bug in pip install of v0.2.1 due to scikit-build-core removing all .metal files in the source distribution (see #701)

[0.2.1]

Fix bug in pip install of v0.2.0 due to .git folder being included in the source distribution (see #701)

[0.2.0]

Migrated to scikit-build-core build system by @abetlen in #499
Use numpy views for LogitsProcessor and StoppingCriteria instead of python lists by @abetlen in #499
Drop support for end-of-life Python3.7 by @abetlen in #499
Convert low level llama.cpp constants to use basic python types instead of ctypes types by @abetlen in #499

[0.1.85]

Add llama_cpp.__version__ attribute by @janvdp in #684
Fix low level api examples by @jbochi in #680

[0.1.84]

Update llama.cpp

[0.1.83]

Update llama.cpp

[0.1.82]

Update llama.cpp

[0.1.81]

Update llama.cpp

[0.1.80]

Update llama.cpp

[0.1.79]

GGUF Support (breaking change requiring new model format)

[0.1.78]

Grammar based sampling via LlamaGrammar which can be passed to completions
Make n_gpu_layers == -1 offload all layers

[0.1.77]

(llama.cpp) Update llama.cpp add support for LLaMa 2 70B
(server) Add temporary n_gqa and rms_norm_eps parameters required for LLaMa 2 70B

[0.1.76]

(llama.cpp) Update llama.cpp add support for LLaMa 2 70B

[0.1.75]

Update llama.cpp

[0.1.74]

(server) OpenAI style error responses

[0.1.73]

(server) Add rope parameters to server settings

[0.1.72]

(llama.cpp) Update llama.cpp added custom_rope for extended context lengths

[0.1.71]

(llama.cpp) Update llama.cpp
(server) Fix several pydantic v2 migration bugs

[0.1.70]

(Llama.create_completion) Revert change so that max_tokens is not truncated to context_size in create_completion
(server) Fixed changed settings field names from pydantic v2 migration

[0.1.69]

(server) Streaming requests can are now interrupted pre-maturely when a concurrent request is made. Can be controlled with the interrupt_requests setting.
(server) Moved to fastapi v0.100.0 and pydantic v2
(docker) Added a new "simple" image that builds llama.cpp from source when started.
(server) performance improvements by avoiding unnecessary memory allocations during sampling

[0.1.68]

(llama.cpp) Update llama.cpp

[0.1.67]

Fix performance bug in Llama model by pre-allocating memory tokens and logits.
Fix bug in Llama model where the model was not free'd after use.

[0.1.66]

(llama.cpp) New model API
Performance issue during eval caused by looped np.concatenate call
State pickling issue when saving cache to disk

[0.1.65]

(llama.cpp) Fix struct misalignment bug

[0.1.64]

(llama.cpp) Update llama.cpp
Fix docs for seed. Set -1 for random.

[0.1.63]

(llama.cpp) Add full gpu utilisation in CUDA
(llama.cpp) Add get_vocab
(llama.cpp) Add low_vram parameter
(server) Add logit_bias parameter

[0.1.62]

Metal support working
Cache re-enabled

[0.1.61]

Fix broken pip installation

[0.1.60]

NOTE: This release was deleted due to a bug with the packaging system that caused pip installations to fail.

Truncate max_tokens in create_completion so requested tokens doesn't exceed context size.
Temporarily disable cache for completion requests

[v0.1.59]

(llama.cpp) k-quants support
(server) mirostat sampling parameters to server
Support both .so and .dylib for libllama on MacOS

[v0.1.58]

(llama.cpp) Metal Silicon support

[v0.1.57]

(llama.cpp) OpenLlama 3B support

[v0.1.56]

(misc) Added first version of the changelog
(server) Use async routes
(python-api) Use numpy for internal buffers to reduce memory usage and improve performance.
(python-api) Performance bug in stop sequence check slowing down streaming.

FilesExpand file tree

CHANGELOG.md

Latest commit

History