Skip to content

Latest commit

 

History

History
1507 lines (1055 loc) · 92.6 KB

File metadata and controls

1507 lines (1055 loc) · 92.6 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

[0.3.35] Gemma 4 series & LFM 2.5-VL Support, OpenAI OpenAPI Alignment and Logging Architecture Migration

  • fix: expand stop sequences for Gemma4ChatHandler

    • Add GEMMA4_EOS_TOKEN and GEMMA4_STR_TOKEN to the generation stop criteria.
    • Align the stopping logic with the model's generation_config.json definitions.
    • Prevent potential over-generation by ensuring the model halts correctly at standard EOS or when initiating a tool response.
  • feat(types): align with latest OpenAI OpenAPI spec (audio, structured outputs)

    • Update llama_types.py OpenAI OpenAPI Link
    • Add developer role.
    • Replace Anyscale-specific JSON schema with official OpenAI json_schema response format for Structured Outputs.
    • Add input_audio and file types to request message content parts.
    • Add audio, refusal, and annotations (e.g., URL citations) fields to response messages.
    • Add content_filter to finish reasons and strictly define global ChatCompletionRole.
  • docs: clarify enable_thinking compatibility for Gemma 4 models

    • Update Gemma4ChatHandler class docstring and __init__ args documentation.
    • Specify that the enable_thinking toggle is exclusively supported by Gemma4 31B and 26BA4B variants.
    • Explicitly note that E2B and E4B models do not currently support this feature to prevent configuration errors.
  • feat(chat_format): Implemented Gemma4ChatHandler, add Gemma 4 chat handler with multimodal and tool support

    • Implement Gemma4ChatHandler with Gemma 4 specific tokens (<|turn>, <|channel>, etc.).
    • Add complex Jinja2 template for advanced nested tool/function schema formatting.
    • Support multimodal content injection for image_url, audio_url, and input_audio (including base64 reconstruction).
    • Integrate reasoning/thinking controls via enable_thinking toggle and <|channel>thought formatting.
    • Configure <turn|> as the primary stop sequence for generation boundaries.
  • feat(chat_format) Implemented LFM25VLChatHandler for LFM2.5-VL (by @alcoftTAO)

  • fix Qwen3.5 chat template typos(reported by @abdullah-cod9)

  • refactor(logger): migrate from llama_log_callback to ggml_log_callback

    • Remove the deprecated llama_log_callback typedef from llama_cpp.py.
    • Update _logger.py to use ggml_log_callback from _ggml, aligning with the upstream GGML logging architecture.
    • Rename the callback references across the codebase, including the MTMD context initialization in llama_chat_format.py.
  • feat(ggml): add support for ggml-base library and new function bindings

    • Load the new ggml-base shared library alongside ggml.
    • Add ctypes bindings for ggml_log_get, ggml_log_set, and ggml_set_zero using the ggml_base_function decorator.
  • Update README.md

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/58190cc84d846d8575ba26e8486bc29d9fd8ad55

  • feat: Sync llama.cpp llama/mtmd API Binding 20260402

More information see: https://github.com/JamePeng/llama-cpp-python/compare/a184583e908cc138fd15794986b3581521fb9b0c...232092e32b3563159a86aacb168da06c4937192b

[0.3.34] Dynamic LoRA Routing, Control Vectors, and Assistant Prefill

  • feat(chat_format): added assistant_prefill to seamlessly continue responses

    • This commit introduces the assistant_prefill parameter to the chat completion API, satisfying the highly requested need to continue interrupted or partially generated assistant messages.
    • Resolves #97 (Chat completion from unfinished response)
    • Usage:
      • Simply set assistant_prefill=True in create_chat_completion when the final item in your messages list is a partial assistant response. The engine will use it as a prompt base and continue generating seamlessly.
    • docs(readme): add documentation for Assistant Prefill features
      • Also slightly updated the huggingface_hub installation instructions for accuracy.
  • feat(internals): implement dynamic LoRA routing and Control Vector support

    • This commit overhauls the adapter management architecture in _internals.py to support dynamic, per-request LoRA routing and Control Vector (CVec) injection with strict C++ memory safety.

    • Key changes:

      • Secure Memory Management: Introduced the LlamaLoraAdapter wrapper class to securely handle the lifecycle of llama_adapter_lora_p pointers, preventing VRAM leaks. Also added support for extracting ALoRA invocation tokens.
      • Model-Level Registry: Added _lora_registry to LlamaModel with robust methods (load_lora, unload_lora, unload_all_loras) to preload adapters into VRAM. Integrated cleanup into the model's ExitStack and close() methods for deterministic memory release.
      • Context-Level Dynamic Routing: Implemented apply_loras and clear_loras in LlamaContext to dynamically swap compute graph weights using contiguous C arrays, enabling zero-delay multi-tenant LoRA switching.
      • Control Vector Integration: Added apply_cvec and clear_cvec to LlamaContext for representation engineering. Includes strict C++ memory layout validation (enforcing buffer zero-padding up to n_embd * il_end) to prevent silent write failures in the GGML backend.
      • Observability & Docs: Added verbose logging for adapter/CVec application and expanded docstrings for context utility methods (e.g., threading, causal attention, warmup).
      • Update README.md for Dynamic LoRA Routing & Control Vectors
  • fix(types): correct llama_adapter_get_alora_invocation_tokens ctypes signature and use pointer for llama_token

  • fix(types): correct llama_set_adapters_lora LoRA adapter ctypes signature and use pointer for scales

    • change scale: float to float* (POINTER(c_float))
    • make adapters and scales optional arrays to match C API
  • refactor: remove legacy static LoRA initialization

    • Removed lora_base, lora_path, and lora_scale from Llama init parameters and state.
    • Dropped outdated llama_adapter_lora_init and llama_set_adapters_lora bindings in the constructor.
    • Restored default use_mmap behavior (no longer forced to False when LoRA is present).
    • This removes the global context pollution and paves the way for the new dynamic, per-request LoRA routing architecture.
  • chore: enhance hybrid cache logging and document M-RoPE token usage

    • Added explanatory comments detailing why n_tokens is used instead of chunk_n_pos for M-RoPE models (to prevent the system from skipping evaluation).
    • Added verbose logging for hybrid cache clearance scenarios (when checkpoints are missing, restore fails, or max_checkpoints is 0).
  • feat(core): add verbose debug logging to longest_token_prefix fast paths

    • Added an optional verbose parameter to Llama.longest_token_prefix to explicitly log early-exit conditions. This provides crucial visibility into cache-miss behaviors during debugging by outputting the specific reason for a fast exit (e.g., empty sequence vs. mismatched first token) along with the offending sequence lengths or token values.
  • Update MIT license copyright to collective authorship (2023-2026)

    • Change single-author copyright to The llama-cpp-python authors and apply standard multi-line formatting for better readability.
    • Every contributor who participates and makes an effort makes the project more reliable, efficient, and user-friendly, and they all deserve to be remembered.
    • Welcome to join us in promoting the project and enriching the open-source community.
  • Update CMakeLists.txt

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f

  • feat: Sync llama.cpp llama/mtmd API Binding 20260325

More information see: https://github.com/JamePeng/llama-cpp-python/compare/6bbc8d2306319c67c9f7d0d2d0576496f3587a3c...a8cec004466493db57d3cbc043cdc897b2b37f9b

[0.3.33] Fixing Multimodal Image Freezes, Stabilizing Logits, and Optimized Legacy Cache Logic

  • perf(mtmd): optimize media_id masking with bitwise AND

    • Replaced the modulo operation (% (2**24)) with a bitwise AND mask (& 0xFFFFFF) when calculating the deterministic media_id from the CRC32 hash. This is a micro-optimization that leverages faster native CPU bitwise instructions instead of division, resulting in more idiomatic and performant low-level bit masking.
    • When processing 1-2 images, the difference between % and & is only a few nanoseconds, imperceptible to the user. In future video processing, you might need to frantically calculate IDs within a for loop for 100 or even 300 frames of data. In this case, the extremely low CPU overhead of the bitwise operation & 0xFFFFFF ensures that the main thread will not experience any computational blockage at the Python layer when building a virtual ledger of tens of thousands of characters.
  • fix(mtmd): prevent multimodal image freeze by injecting deterministic media IDs

    • Fixed a critical "image freeze" bug (report by @KLL535) where the model would continuously reuse the first cached image regardless of new inputs.
    • The issue occurred because the C++ self._mtmd_cpp.mtmd_input_chunk_get_id(chunk) parser returns an empty ID (b'') for placeholder tokens like <__media__>, causing all media to fallback to the same magic number (-314159). This resulted in false-positive KV cache prefix matches.
    • Replaced the C++ chunk ID extraction with a Python-side media_item_cursor that generates a deterministic 32-bit negative ID using zlib.crc32(real_media_url).
    • This ensures longest_token_prefix correctly identifies and reuses identical images while instantly breaking the cache match when the image content changes.
    • Chat Structure: [System Prompt] + [Image] + [Question]
    • The computational cost should be significantly reduced for the same image but different questions.
  • fix(Llama.generate): add explicit fallback context reset and expand Llama.generate docstrings

    • Added a fallback if reset: block in Llama.generate to ensure the KV cache and hybrid cache manager are explicitly cleared when reset=True is passed and no prefix match is found. This prevents potential context poisoning from previous runs.
    • Added comprehensive docstrings to the generate method for all newly integrated sampler parameters (e.g., XTC, Mirostat, DRY penalties etc.).
    • Added explicit verbose logging for cache resetting, rollback events, and speculative decoding behaviors to improve debuggability.
  • docs(README.md): add sampling parameter guide and strategic project tips to README

    • Sampling Documentation: Added a comprehensive guide for LlamaSamplingParams. It covers core, advanced (XTC, Dynatemp, Adaptive-P), entropy (Mirostat), and DRY repetition penalty configurations with a clean Python usage example.
    • Project Tips: Added a new "Quick tips" section to explicitly communicate the semi-deprecated status of llama_cpp.server in favor of the upstream llama-server.
    • Backend Recommendations: Added practical advice for AMD and Intel GPU users, officially recommending the Vulkan backend for cross-platform stability and faster updates.
  • fix(sampling): prevent memory drift and hallucinations in logits view

    • Previously, the numpy view for logits_ptr in LlamaSamplingContext.sample was only initialized once. If the underlying C++ buffer was reallocated or shifted (e.g., due to dynamic batch sizes or KV cache shifts), the numpy array would point to stale memory, leading to severe model hallucinations or segfaults.
    • Added explicit _logits_ptr_addr tracking to monitor the physical C memory address.
    • The zero-copy numpy view (_logits_view) is now safely recreated on-the-fly whenever the backend memory address changes.
    • Added proper initialization and garbage collection for the new tracker in __init__ and close.
  • feat(_ggml): extend ctypes bindings with more ggml constants, enums, and structs

  • fix(chat_handler): fix tools and function calling in MTMDChatHandler.(Issue reported by @alcoftTAO)

  • perf(cache): optimize LlamaDiskCache I/O and fix LRU behavior

    • Delegated LRU and size limits to native diskcache SQLite engine, removing the slow manual eviction loop.
    • Added an O(1) early exit in _find_longest_prefix_key to prevent unnecessary full-table disk scans.
    • Fixed a destructive read bug by replacing .pop() with standard access to properly update LRU timestamps.
    • Added fast-path empty checks to bypass disk queries entirely when the cache is empty.
  • perf(cache): upgrade LlamaRAMCache to O(1) eviction and set LlamaTrieCache as default Addressed severe performance bottlenecks in legacy RAM caching components:

    • Refactored LlamaRAMCache to use an O(1) _current_size tracker instead of an O(N) dynamic sum. This eliminates massive CPU spikes and O(N^2) complexity during LRU eviction cycles.
    • Added strict OOM safeguards to LlamaRAMCache: The current size is explicitly clamped to 0 during evictions, and hard-reset to 0 if the cache empties, preventing catastrophic capacity drift.
    • Introduced early-exit O(1) short-circuits in __getitem__ and __contains__ to bypass expensive prefix searches when the cache is empty.
    • Updated the LlamaCache backward-compatibility alias to point to the highly optimized LlamaTrieCache instead of the legacy LlamaRAMCache.
  • fix(core): disable swa_full for non-SWA models (sync llama.cpp upstream #20291)

    • Fallback context_params.swa_full to False if _n_swa == 0 and emit a warning.
    • Updated is_hybrid validation to use the resolved self.context_params.swa_full state.
  • fix(chat_format): fix namespace and variable shadowing of llama modules

    • Changed imports to use llama_cpp_lib and llama_core to avoid namespace collisions.
    • Fixed severe variable shadowing where the llama module was being overshadowed by the llama parameter in function signatures.
    • Updated associated type hints and C-API bindings to use the new isolated aliases.
    • Corrected LlamaGrammar type definitions to point to the llama_grammar module.
  • fix(cache): fix namespace shadowing to prevent AttributeError (Issue reported by @kantan-kanto)

    • Renamed llama_cpp.llama import to llama_core and llama_cpp.llama_cpp to llama_cpp_lib to prevent namespace collision.
    • Fixed AttributeError thrown when accessing llama_cpp.llama.Llama.longest_token_prefix.
    • Updated all associated type hints and C-API bindings in cache classes to use the new isolated aliases.
  • feat(MTMDChatHandler): support audio inputs and fix interleaved media ordering

    • Refactored CHAT_FORMAT to use a single loop for message.content, preserving the exact chronological order of interleaved text, images, and audio.
    • Added template routing for audio_url.
    • Added template routing for OpenAI's input_audio format, properly formatting it as a Data URI.
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/d23355afc319f598d0e588a2d16a4da82e14ff41

  • feat: Sync llama.cpp llama/mtmd API Binding 20260313

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e7e1d48065ba53846f290cfe563c8c839a062ebe...b0f00a96d3803863dd7d479f0cb1305f76f741b3

[0.3.32] Hybrid/Multimodal Model Single-Turn Optimizations & Fix Sampling Seed

  • perf(hybrid): optimize multimodal single-turn and fix KV clear bug

    • Added a 100% match "FAST PATH" in Llama.generate to bypass N-1 truncation for hybrid models when caching is disabled.
    • Fixed a bug where failed rollbacks on disabled caches would wipe the KV cache, causing multimodal pseudo-token crashes.
    • Updated MTMDChatHandler to suppress cache-related logs and anchoring logic when max_checkpoints <= 0.
  • perf(hybrid): prevent expensive array slicing when cache is disabled

    • Added a max_checkpoints > 0 check to the finally block of the generation loop.
    • Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing self._input_ids[:self.n_tokens].tolist(). For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
  • perf(hybrid): bypass N-1 evaluation split if max_checkpoints is 0

    • Prevent fragmenting the prompt evaluation into len(tokens)-1 and 1 when hybrid caching is disabled.
    • Allows the underlying C++ engine to process the entire prompt in a single, efficient batch for single-turn workflows.
  • perf(hybrid): eliminate PCIe I/O latency for single-turn workflows

    • This commit introduces critical performance optimizations and log tracing improvements for HybridCheckpointCache in single-turn workflows (e.g., ComfyUI or single-turn conversation mode):
      • Now support 0 HybridCheckpointCache for single-turn conversation.(set the ctx_checkpoints=0 when llama init )
      • Added early-exit intercepts for max_checkpoints <= 0 in save_checkpoint and find_best_checkpoint. This prevents massive (e.g., 150MB+) synchronous VRAM-to-RAM state extractions over the PCIe bus when rollback capabilities are disabled, eliminating a ~3-second blocking delay at the end of generation.
      • Added a non-empty check in clear() to prevent log spam when the cache is already empty or disabled.
      • Standardized logging prefixes (e.g., HybridCheckpointCache(save_checkpoint)) for better observability.
      • Fixed a potential UnicodeEncodeError hazard in warning logs by replacing a non-standard arrow character with standard ASCII (->).
  • fix(sampling): pass seed to sampling context and remove global mutation

    • Add seed parameter to generate and sample method signatures.
    • Pass the resolved seed directly to LlamaSamplingParams to ensure the underlying C++ sampler uses it.
    • Remove thread-unsafe self.set_seed() calls in _create_completion to prevent global state pollution during concurrent requests.
  • docs(issue-template): modernize bug report for efficiency

    • Completely revamped the legacy bug report template to streamline troubleshooting. Added an anti-AI-spam policy, a detailed OS/Hardware matrix, forced verbose=True logging requirements with code examples, and new sections for model parameters and AI-assisted brainstorming.
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25

[0.3.31] Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging

  • refactor(mtmd): introduce omni-modal media pipeline with experimental audio support This commit significantly overhauls the media parsing and loading pipeline in MTMDChatHandler to gracefully handle both vision and audio inputs, establishing a true omni-modal architecture.

    Key structural changes:

    • Hardware Capability Sniffing: _init_mtmd_context now actively probes the C++ backend for ctx_v (vision) and ctx_a (audio) encoders, enabling proactive fail-fast validation before media processing.

    • Unified Media Extraction: Replaced get_image_urls and split_text_on_image_urls with a robust _get_media_items method. This safely parses image_url, input_audio, and audio_url while strictly maintaining the chronological order of user prompts and enforcing OpenAI format specs.

    • Media Dispatcher & Magic Bytes: Introduced a unified load_media dispatcher. Added a new _load_audio method and a rigorous detect_audio_format static method that accurately mimics llama.cpp's C++ magic bytes sniffing (RIFF/WAVE, ID3/MPEG, fLaC) to prevent fatal backend crashes.

    • Concurrent Omni-Decoding: The ThreadPoolExecutor in _process_mtmd_prompt has been upgraded to concurrently fetch and decode both image and audio payloads into unified mtmd_bitmap structures.

    • Note: Audio processing capabilities in the underlying llama.cpp engine are currently in an experimental stage.

  • fix(hybrid): implement N-1 checkpointing to support 1-token rollbacks

    • Forces an N-1 state snapshot during prompt prefilling for hybrid models. This ensures the engine can safely perform a 1-token rollback to refresh logits upon 100% cache matches (e.g., changing seeds on identical prompts), preventing RNN state desyncs and empty outputs.

    • Note: For the Comfyui plugin developer, I recommend performing a reset operation before inputting the prompt word. This way, the seed will be included as one of the factors in the initial complete recalculation.

  • fix(mtmd): remove OS-level log suppression to expose critical C++ errors

    • Removed the suppress_stdout_stderr context manager around critical C++ backend calls (_init_mtmd_context, _create_bitmap_from_bytes, and close).

    • Previously, when verbose=False, this OS-level file descriptor redirection was swallowing fatal C++ backend errors (e.g., stb_image decoding failures, corrupted .mmproj model weights, or CUDA Out-Of-Memory aborts), resulting in silent crashes that were impossible to debug. The framework now correctly relies on the native C-API llama_log_callback to route logs to Python gracefully, ensuring that critical decoding and hardware exceptions remain visible to the developer.

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/f5ddcd1696eca5069dc7915f4d4c03c9a709afea

[0.3.30] Milestone Release

I will update the release notes for version 0.3.30 in the discussion.

  • refactor(mtmd): redesign multimodal pipeline for concurrent I/O and hybrid state management This commit fundamentally restructures the MTMDChatHandler pipeline, decoupling the prefill and evaluation stages to resolve previous I/O bottlenecks and state-sync issues. The new architecture fully supports hybrid/recurrent multimodal models (e.g., Qwen3.5s, LFM2-VL) with robust context management.

    Key structural advantages and changes:

    • Concurrent Media Decoding: Implemented ThreadPoolExecutor in _process_mtmd_prompt with pre-allocated arrays, allowing thread-safe parallel image/audio decoding while strictly preserving the chronological order of user inputs, and can be used in the future to process large numbers of video frames.
    • O(1) Prefix Matching ("Negative Reverse Vocabulary"): Replaced slow dictionary lookups with a deterministic hash-to-negative-integer mapping for media IDs. This isolates media tokens from the LLM's positive vocabulary space, enabling native, ultra-fast longest_token_prefix array comparisons in Python.
    • Hybrid Model State Management: Replaced aggressive mid-turn saving with highly efficient "End-of-Turn" checkpointing. This ensures multi-image prompts consume only a single LRU slot while allowing precise rollback to the nearest valid state upon cache misses.
    • Robust Context Shift (OOM Defense): The __call__ loop now preemptively calculates token boundaries for upcoming multimodal chunks, safely discarding the oldest unpinned tokens from both the physical KV cache and the Python virtual ledger to prevent backend crashes.
    • Qwen3.5 Support CONFRIMED, waiting Qwen35ChatHandler PR merge
  • merge: Implemented Qwen35ChatHandler for Qwen3.5(by @alcoftTAO)

  • fix: Correct the mtmd vision check condition bug

  • refactor(chat_handler): extract MTMDChatHandler base class and Simplify the complexity of subsequent multimodal adaptation

    • Extracted the core multimodal processing pipeline from Llava15ChatHandler into a generic MTMDChatHandler base class, separating pipeline logic from model-specific prompt formats.
    • Updated all multimodal subclass handlers (e.g., Gemma3, Granite-Docling, PaddleOCR, Qwen2.5vl, Qwen3-vl, MiniCPM, GLM4.xV, LFM2-VL) to inherit from the new base class MTMDChatHandler.
    • Implemented strict **kwargs validation in the baseconstructor to gracefully intercept and report unsupported parameters, significantly improving Developer Experience (DX).
    • Introduced dynamic self.log_prefix (self.__class__.__name__) for accurate and consistent logging across all subclasses.
    • Cleaned up redundant state-clearing, image-count logic and hardcoded print statements across subclass __call__ implementations.
    • To avoid exceptions occurring when the close method is called due to initialization failure and the call to exit_stack.
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469

  • feat: Sync llama.cpp llama/mtmd API Binding 20260301

Many thanks to @yamikumo-DSD and @roj234 for providing detailed testing and valuable suggestions.

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e4861df5fd44bb83ec2b9063ca3375759416aead...3f8f0f89a2b72ec2f9494fa5f14206591a5cde49

[0.3.29]

  • perf(eval): implement adaptive checkpoint intervals for hybrid models

    • Dynamically scale checkpoint frequency during large prompt pre-filling (max 3 triggers per eval) to minimize I/O bottlenecks and stuttering.
    • Add success validation to save_checkpoint, ensuring the last_ckpt_pos tracker is only updated when the state is successfully saved to disk/memory.
    • Enhance verbose logging to track dynamic interval calculations and save failures.
  • fix(eval): make context shift mathematically robust and architecture-safe

    • Added a memory_can_shift() pre-flight check to proactively intercept and abort gracefully on architectures that physically forbid shifting (e.g., multimodal mmproj where n_pos_per_embd > 1 or incompatible M-RoPE), preventing fatal GGML_ASSERT C++ crashes.
    • Implemented dynamic mathematical bounds for n_keep and n_discard to guarantee that enough space is always freed, completely eliminating the edge-case where n_discard evaluates to 0 (causing a dead-loop when n_ctx is extremely small).
    • Wrapped underlying C++ memory shift operations in a try-except block for defense-in-depth against unexpected backend failures.
    • Expanded in-code documentation to clarify the arithmetic constraints and architectural limitations of the KV shift mechanism.
  • Add the memory_can_shift API to class LlamaContext

  • feat(eval): enable native context shift for hybrid/recurrent models

    • Removed the RuntimeError that previously blocked context shifting for hybrid and SWA architectures.
    • Delegated the shift logic to the underlying C++ backend, which automatically handles Attention KV removal and RNN pos shifting.
    • Added dynamic verbose logging to clearly identify the model type (Transformer vs. Hybrid/Recurrent/SWA) during a context shift event.
  • fix(eval): prevent batch size from halving below 1 during KV slot exhaustion

    • Added an explicit guard to break the dynamic batch downgrade loop when current_batch_size is exactly 1 and a Code 1 (No KV slot) is returned.
    • Prevents the engine from executing an invalid 1 // 2 operation and generating the confusing "Halving batch size from 1 to 0" verbose log.
    • Ensures the evaluation process fails fast and aborts gracefully when physical VRAM is completely depleted and no further fallback is mathematically possible.
  • feat(hybrid): add periodic checkpointing and adaptive batch handling

    • Increase default ctx_checkpoints from 16 to 32
    • Add new parameter checkpoint_interval (default: 4096) for hybrid model state snapshots
    • Implement robust dynamic batch downgrade on KV cache exhaustion (status=1)
    • Introduce periodic checkpoint saves during eval in hybrid mode
    • Improve error handling and logging around context shifts and decoding failures
  • Optimization (decode): treat KV slot exhaustion (code 1) as a recoverable return value

    • Updated the decode wrapper to explicitly return 1 instead of raising a RuntimeError when llama_decode indicates no KV slots are available.
    • Aligned Python API behavior with the underlying C++ contract, treating code 1 as a recoverable signal rather than a fatal crash.
    • Enabled upper-level caller loops (like eval) to gracefully handle VRAM fragmentation via dynamic batch halving without relying on clumsy try-except block string parsing.
    • Retained strict RuntimeError exceptions for truly fatal backend failures (e.g., codes -1, -2, -3).
    • Added comprehensive docstrings detailing return codes and exception scenarios.
  • feat(core): overhaul generate and eval for hybrid model support(Qwen3-next、Qwen3.5 etc.)

    • Integrated HybridCheckpointCache into the generation loop to support state rollback for recurrent/hybrid architectures.
    • Implemented Context Shift (sliding window) in eval to gracefully prevent OOM when exceeding n_ctx.
    • Adapted eval to use the newly vectorized LlamaBatch.add_sequence API with dynamic logits_array configuration.
    • Fixed the full prefix match bug by forcing a 1-token re-evaluation to refresh logits.
    • Disabled speculative decoding for hybrid models to prevent irreversible state pollution.
    • Wrapped the generation loop in a try...finally block to guarantee safe checkpoint saving.
  • refactor(LlamaBatch): replace set_batch with granular add_token + vectorized add_sequence

    • Introduce high-performance add_token() for single-token append in generation loop
    • Add flexible add_sequence() with per-token pos/seq_ids/logits arrays
    • Remove old set_batch() that assumed single-seq + forced last logit
    • Better support for multi-sequence and precise logit control

[0.3.28]

  • fix(HybridCheckpointCache): ValueError: bytes must be in range(0, 256)

  • feat: add HybridCheckpointCache detect support for recurrent/hybrid/SWA models

    • Introduce ctx_checkpoints parameter (default 16)
    • Detect recurrent / hybrid / n_swa > 0 models in init
    • Automatically use HybridCheckpointCache when hybrid architecture is detected
    • Properly close and clear HybridCheckpointCache in del
  • fix(cache): add safety guards to checkpoint restore and optimize API calls

    • Replaced direct llama_cpp API calls with cached function pointers (self._get_size_ext, etc.) for better performance and consistency.
    • Added sequence ID validation with verbose error logging to prevent cross-sequence contamination.
    • Added strict state size validation before restoration to prevent buffer overflows and backend segmentation faults.
  • Remove redundant seq_id and add resource cleanup

    • Removed seq_id from HybridCheckpointCache initialization to make it a stateless, global multi-sequence manager.
    • Added close() and __del__() methods to safely release C++ context references and prevent memory leaks.
  • feat(cache): implement HybridCheckpointCache for hybrid/recurrent models Introduces a dedicated caching mechanism to support state rollback for models that cannot physically truncate their KV cache (e.g., Qwen3-Next, Qwen3.5, etc.).

    Key additions and changes:

    • Add HybridCheckpoint dataclass to store RNN state snapshots along with their binary data and metadata.
    • Implement HybridCheckpointCache to manage sequence-specific states using the llama_state_seq_*_ext C++ APIs.
    • Introduce _hash_prefix using SHA-256 to guarantee cryptographic certainty when matching prompt histories, preventing state corruption.
    • Add save_checkpoint with a FIFO eviction policy to strictly bound memory usage based on max_checkpoints.
    • Add restore_checkpoint to securely inject valid RNN states back into the C++ backend.
    • Explicitly disable incompatible dictionary interfaces (__getitem__, __setitem__, __contains__) inherited from BaseLlamaCache.
    • Refactor module imports (alphabetical sorting) and relocate LlamaDiskCache for better structural consistency.
  • Remove the hack code in llama_chat_format.py

  • LLama: Optimize KV cache management for multi-round conversations

    • Implements prefix-matching logic to truncate stale "ghost" tokens in C++ KV cache
    • Prevents attention misalignment and context poisoning during multi-turn interactions
    • Reduces memory overhead by reusing matched prefixes efficiently

[0.3.27]

  • feat: add PaddleOCR-VL-1.5 multimodal chat handler PaddleOCRChatHandler

  • fix: resolve VRAM leak in multimodal models by explicitly closing mtmd context

    • Remove the ExitStack closure in Llava15ChatHandler to break circular references preventing garbage collection of the vision context.
    • Implement explicit close() and __del__() methods in the chat handler to safely free mtmd_ctx.
    • Integrate chat_handler.close() into the main Llama.close() lifecycle and nullify related attributes for immediate memory reclamation.
  • fix: resolve reload memory leaks by breaking circular references in cleanup

    • Remove closure-based ExitStack callbacks in LlamaModel, LlamaContext, and LlamaBatch to eliminate circular references.
    • Move explicit C-level memory freeing (llama_*_free) directly into close() methods.
    • Nullify additional attributes (tokenizer_, model_params, etc.) in Llama.close() to guarantee immediate garbage collection during unload.
    • Add todo comment to LoRA process logic, the current LoRa loading logic is outdated and needs to be refactored.
  • feat: add memory breakdown and sampler performance timings APIs

  • feat: Search system paths for shared libraries(by @benniekiss)

  • Optimize longest_token_prefix to use zero-copy NumPy arrays and drop .tolist() overhead

  • Free _candidates and large numpy arrays during explicit close()

  • fix: resolve memory leaks in sampling context lifecycle

    • Safely close temporary LlamaSamplingContext in sample() using a try-finally block.
    • Explicitly release the previous _sampling_ctx in generate() before re-assignment to prevent orphaned pointers.
    • Ensure _sampling_ctx is properly freed in Llama.close().
  • Fix custom sampler memory cleanup and improve lifecycle management

    • Add explicit close() and __del__() to CustomSampler to safely free C resources and break Python reference cycles
    • Ensure custom samplers are properly detached and freed in LlamaSampler.close()
    • Add minor documentation comments for clarity
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/cacc371f99fb3b5b431d3fa89ac0c752bbd62a3b

  • feat: Sync llama.cpp llama/mtmd API Binding 20260223

More information see: https://github.com/JamePeng/llama-cpp-python/compare/3d0fd1b75ee564361a4babf21f88855225ba1fe0...1f8341ee74a2fea15a1008487cbecaccf918c755

[0.3.26]

  • perf(llama-cpp): optimize LlamaTokenDataArray memory operations

    • Cache NumPy field views for 'id', 'logit', and 'p' to bypass expensive property lookups.
    • Refactor copy_logits to use pre-generated ID sequences and cached views.
    • Ensure logical consistency by resetting token IDs every sampling step to counter C++ reordering.
    • Minimize redundant memory allocations during the inference loop.
  • feat: Add explicit memory cleanup for sampling contexts

    • Implements close() and __del__ for LlamaTokenDataArray and expands LlamaSamplingContext cleanup.
    • Ensures NumPy views and internal C-references are properly released to allow Python GC to reclaim memory.
  • optimize(memory): reduce scores buffer size and optimize state saving

    • Update save_state and load_state API use.
    • Refactored self.scores to allocate only a single row (1, n_vocab) when logits_all=False, significantly reducing memory usage for large vocabulary models.
    • Optimized save_state to eliminate redundant memory allocations and copies by using ctypes.string_at.
    • Updated load_state, eval, and sampler adapters to correctly handle the dynamic shape of self.scores.
  • Fix CMake install layout to avoid top-level bin directory in site-packages

  • ggml: Load ggml library from candidate path list

    • Auto-select lib/ or bin/ directories
    • Add backend loading functions
  • feat(loader): extend default library search paths on Linux and macOS

    • load_shared_library to include a path list feature (allowing you to add custom paths in addition to the default ones). You can later add your own paths to the libggml_base_paths candidate list in _ggml.py, such as those not commonly used Python paths.
    • fix: Enhance the handling logic for non-existent file paths.
    • Improves reliability of shared library discovery for system-wide installations.
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/abb9f3c42b5e6acee9e8e37836ef691d1a41bdb8

  • feat: Sync llama.cpp llama/mtmd API Binding 20260219

More information see: https://github.com/JamePeng/llama-cpp-python/compare/b03224b2dde3c8cbdd8bf529794e3a41ac7f5751...6b38182a17effd0cedbf8736bee2464dd6c7b4da

[0.3.25]

  • feat: Refactor Llama class to use new LlamaSampler chain API from _internals This commit refactors the high-level Llama class to fully utilize the new C++ llama_sampler chain architecture via LlamaSamplingContext.

    • Replaced manual sampling logic and obsolete _init_sampler with LlamaSamplingContext.
    • Updated sample() and generate() to support the full suite of modern sampling strategies (DRY, XTC, Adaptive-P, Infill, etc.).
    • Added new sampling parameters to all generation methods (create_completion, create_chat_completion, __call__):
    • dynatemp_range, dynatemp_exponent (Dynamic Temperature)
    • min_keep
    • Refactored logits_processor handling to use CustomSampler adapter for better performance and C++ interop.
    • Improved sampling state management (e.g., repetition penalties) by persisting _sampling_ctx during generation.
    • Removed manual logit_bias processing in Python; now delegated to the underlying sampler chain.
  • feat: Separate the grammar sampler, improve the code stability of Sampler Chain processing, and fix some bugs.

  • Improve sampling and grammar lifecycle management, fix memory growth issues

    • Validate grammar sampler initialization and inputs
    • Replace unbounded prev token list with bounded deque by LlamaSamplingParams n_prev param
    • Reuse logits NumPy view to avoid repeated allocations
    • Reuse single-token buffers for grammar rejection sampling
    • Minor cleanups and consistency improvements in sampling flow
  • feat: Fix sampling history alignment with llama.cpp

  • test: update integration tests for new sampler architecture

  • test: replace unstable grammar test with deterministic mechanism check

  • fix: Optimize .gitignore and add macOS system files

  • feat: Refactor the build-wheels-metal.yaml

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/079feab9e3efee1d6d4ca370eac50f156e2dc6e8

  • feat: Sync llama.cpp llama/mtmd API Binding 20260214

More information see: https://github.com/JamePeng/llama-cpp-python/compare/4ab182382b87bbbba4fb05ff184b557414740103...dc5f7e5564dd68af9d62f7d450cda45313f80b5d

[0.3.24]

  • feat: Refactor sampling infrastructure to use llama.cpp sampler chain API

    • LlamaContext: Remove obsolete manual sampling methods.

    • LlamaSampler: Wrap C++ sampler chain; add support for DRY, XTC, Adaptive-P, and lazy grammar.add_grammar merged the processing branches for regular grammar and lazy grammar.

    • LlamaSamplingContext: Update to build and manage sampler chains instead of manual logic.

    • CustomSampler: Rewrite for proper C-struct lifecycle management and ABI compatibility.

    • Optimizations: Simplified redundant handling and variable usage.

  • feat: Optimize the method definition of class llama_sampler_i and CtypesPointer definition

  • feat: refactor LlamaSamplingParams class

    • base on llama.cpp/common/common.h
    • support more sampler params
  • feat: implement generative reranking with chat template support

    • Chat Template Support: Support for rerank templates has been introduced (via llama_model_chat_template(model, b"rerank")), which can automatically populate the query and document into a specific format.
    • Now support Qwen3-Reranker series model(Non-Vision)
    • Update README.md for Qwen3-Reranker
  • feat: Update README.md for python 3.14 and HIP (ROCm) guide

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/b83111815e9a79949257e9d4b087206b320a3063

  • feat: Sync llama.cpp llama/mtmd API Binding 20260208

More information see: https://github.com/JamePeng/llama-cpp-python/compare/9b985b75105ea9f5e7f2f5e7988a1149f233af2f...acb7efa09293cffa5912242c372b715555e2a884

[0.3.23]

  • feat: Implement MiniCPMv45ChatHandler for MiniCPM-V 4.5 with multi-image tracking

  • feat: Add python 3.14 support and pin numpy version(1.21.4-2.3.2) for compatibility

  • feat: Increased the n_batch parameter in Llama model initialization from 512 to 2048. Slightly improves the speed of some multimodal image processing.

  • fix: catch TemplateSyntaxError when parsing metadata chat templates

    • Some models (e.g., LLaVA 1.5) contain non-standard Jinja2 tags (like {% generation %}) in their metadata.

    • This commit adds a try-except block to prevent initialization crashes, allowing the model to load even if the metadata template is invalid.

  • feat: enhance default system prompt with strong multimodal + same-language capabilities

  • feat: Better Llava1.5 Chat Format

  • ci: Customize wheel filename to improve version identification

    • Parse generated wheel filenames in the build step.
    • Append CUDA version (cuXXX) and AVX level (basic/avx2) to the version string.
    • New format: package-ver+cuXXX.avxver-pyver-abiver-platform.whl (e.g., llama_cpp_python-0.3.23+cu130.basic-cp310-cp310-win_amd64.whl).
  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/b33df266d0a53f800c47513386920cff1019d70e

  • feat: Sync llama.cpp llama/mtmd API Binding 20260129

[0.3.22]

  • perf (TFFT): Optimize longest_token_prefix with Numpy SIMD and fast-fail probe

    • Vectorization: Replaced standard Python zip loop with Numpy SIMD comparison for high-performance context matching.

    • Fast Exit: Added an O(1) probe check (a[0] != b[0]) to eliminate Numpy conversion overhead on mismatches. Making the "Time To First Token" virtually instantaneous for cached sessions.

    • Memory Optimization: Only the intersection of the two sequences ([:min_len]) is converted to Numpy arrays, minimizing memory allocation.

    • Result: Achieved ~5x speedup (129ms -> 25ms) in KV cache reuse scenarios and ~2.5x speedup (554.23ms -> 201.62ms) in load time while maintaining stability on long contexts.

    • The comparative test results are here: JamePeng#47 (comment)

    • This change significantly reduces latency in RAG and chat applications on long contexts.

  • feat: Add support for adaptive_p and infill samplers and optimize the sampler logic.

  • feat: Add initialization checks to the Encoder-Decoder architecture.

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/10c98cbdf623d982f7491e8de5711e916a913192

  • feat: Sync llama.cpp llama/mtmd API Binding 20260116

[0.3.21]

  • perf: optimize tokenization and detokenization logic

    • Refactor tokenize, token_to_piece, and detokenize methods in _internals.py to significantly reduce Python loop overhead and improve the batch-processing performance and stability of load/prompt-eval.

    • Key changes:

      • Replace token_to_piece O(N) Python loops in detokenize with llama.cpp native batch C-API (llama_detokenize).
      • Implement dynamic buffer allocation to safely handle arbitrary token lengths (removing the hardcoded 32-byte limit).
      • Add automatic buffer resizing for tokenize to prevent overflow errors.
    • Performance observations (based on simple benchmarks):

      • Small Batch Processing (127 tokens): Latency reduced from ~117ms to ~37ms (approx. 3.1x speedup in processing loop).
      • Large Batch Processing (2420 tokens): Throughput improved from ~6905 t/s to ~8258 t/s.
      • General Latency: Total execution time for standard chat scenarios reduced by ~1.1s (from 8.4s to 7.3s).
      • The comparative test results are here: JamePeng#47 (comment)
  • feat: Add Granite-Docling multimodel support with GraniteDoclingChatHandler

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7

  • feat: Sync llama.cpp llama/mtmd API Binding 20260110

[0.3.20]

[0.3.19]

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/9b8329de7a7200385aaac16ab4a2ab79ae14d829
  • feat: Sync llama.cpp llama/mtmd API Binding 20251230
  • refactor: Extract embedding logic to LlamaEmbedding class, Rerank support and fix parallel batching
    • Decoupled embedding and rerank logic into llama_embedding.py.
    • Implemented streaming batching for constant memory usage.
    • Fixed parallel batching errors by enabling kv_unified. such as "multiple embeddings in a single call"
    • Added native rank() support for Reranker models.
    • Added advanced normalization support (Euclidean, Taxicab, MaxInt16).
    • Added array,json+ output format for raw vector access.
    • The legacy embedding implementation in llama.py is now superseded by this optimized approach.
  • update README.md here: https://github.com/JamePeng/llama-cpp-python?tab=readme-ov-file#embeddings--reranking-gguf
  • refactor(LlamaBatch): enhance safety checks and fix indexing logic
  • refactor(Llama): enhance error handling and cleanup in eval method
  • Fixed a small bug in the Qwen3-VL chat template (by @alcoftTAO)
  • fix(Llama): implement fallback to full cache clear in eval
    • If memory_seq_rm fails to clear from current_pos, fallback to clearing the entire sequence to prevent invalid input batch errors.

More information see: https://github.com/JamePeng/llama-cpp-python/compare/2efaa346bc0aa0d6648938a0dcdf8d12240a8bed...aa653ea5c6a90505a7491e855cc16988293cedd5

[0.3.18]

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787
  • feat: Sync llama.cpp llama/mtmd API Binding 20251215
  • feat: implement GLM46VChatHandler for GLM-4.6V Series Multimodel
  • feat: implement LFM2VLChatHandler for LFM2-VL series Multimodel
  • feat: implement GLM41VChatHandler for GLM-4.1V-9B-Thinking Multimodel
  • workflow: Added workflows for compiling with CUDA 13.0.2 on Windows and Linux.
  • feat: Added the scan path for CUDA 13.0+ dynamic link libraries under Windows system ($env:CUDA_PATH\bin\x64)
  • Optimization: Improved batch token processing logic in Llava15ChatHandler.
  • perf: optimize LlamaModel.metadata reading performance
    • Increase initial buffer size to 16KB to eliminate re-allocations for large chat templates.
    • Cache ctypes function references to reduce loop overhead.
    • Repeated model loading can result in a cumulative speed improvement of 1-3%.
  • build: Improve CMakeLists target logic
  • refactor: optimize LlamaGrammar class code

More information see: https://github.com/JamePeng/llama-cpp-python/compare/67421d546ddcaa07678ac7921a9f124e7e3de10e...d5131e2ff41e05f83fd847052b06938c7a551a6a

[0.3.17]

  • feat: Update llama.cpp to ggml-org/llama.cpp/commit/054a45c3d313387a4becd5eae982285932852b35
  • feat: Sync llama.cpp llama/mtmd API Binding 20251121
  • feat: Support clip flash-attn
  • feat: 0day support Qwen3VLChatHandler into llama_chat_format.py
  • Update README.md for Qwen3VL example(Thinking/No Thinking)
  • feat: Better Qwen3VL chat template. (by @alcoftTAO)
  • feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie, improves retrieval speed at least 40x compared to the original linear scan method of finding the longest prefix , thereby enhancing service responsiveness.
  • feat: Update Llava15ChatHandler to accept use_gpu, image_min_tokens, and image_max_tokens.Now can pass theimage_min_tokensparameter in Qwen3VLChatHandler to support bbox grounding tasks.
  • feat: Add Pillow process code in _load_image for VLM: that can reliably handle common formats, Supports 20+ image formats (PNG, JPEG, WebP, AVIF, BMP, ICO, TIFF, etc.). Images with alpha channel (PNG, WebP, etc.) → automatically composites on white/black background(white for dark content, black for bright content). Support image format see here: image-file-formats
  • feat: Optimize CUDA Wheel Build Workflow, now workflow action support python3.10-3.13 cu124-cu126-cu128 Basic(Non AVX)-AVX2 win-linux

More information see: https://github.com/JamePeng/llama-cpp-python/compare/e5392b52036bd2770ece5269352f5600a8db5639...fbb0ed2f089c663a5eb75aadcad08f768041ed72

[0.3.16]

[0.3.15]

[0.3.14]

[0.3.13]

[0.3.12]

[0.3.11]

  • fix: Update reference to llama_kv_cache_clear in Llama.embed. Closes #2037 by @abetlen in 9e5a4eaa84156084ed7bbb91e6efcc91dc6217bc

[0.3.10]

  • feat: Update llama.cpp to ggerganov/llama.cpp@28657a8229b5adc6028cf1c4ed62191792d2fdb0
  • feat: Add support for llama.cpp multimodal, add Qwen2.5-VL chat handler by @abetlen in cd548bd0f14210627798237d5c2ea78acfb88ccb

[0.3.9]

  • feat: Update llama.cpp to ggerganov/llama.cpp@8733e0cf6eefc7c7752297cc22d0836706f4222c
  • Make building mtmd optional by setting CMAKE_ARGS="-DMTMD_BUILD=OFF" and using MTMD_CPP_LIB to specify alternative path to shared library

[0.3.8]

  • feat: Update llama.cpp to ggerganov/llama.cpp@7841fc723e059d1fd9640e5c0ef19050fcc7c698

[0.3.7]

  • feat: Update llama.cpp to ggerganov/llama.cpp@794fe23f29fb40104975c91fe19f23798f7c726e
  • fix(ci): Fix the CUDA workflow by @oobabooga in #1894
  • fix: error showing time spent in llama perf context print, adds no_perf flag to Llama class by @shakalaca in #1898

[0.3.6]

  • feat: Update llama.cpp to ggerganov/llama.cpp@f7cd13301c2a88f97073fd119072b4cc92c08df1
  • fix(server): streaming resource lock by @gjpower in #1879

[0.3.5]

  • feat: Update llama.cpp to ggerganov/llama.cpp@26a8406ba9198eb6fdd8329fa717555b4f77f05f
  • fix(ci): Fix release by updating macos runner image to non-deprecated version by @abetlen in afedfc888462f9a6e809dc9455eb3b663764cc3f
  • fix(server): add missing await statements for async exit_stack handling by @gjpower in #1858

[0.3.4]

  • fix(ci): Build wheels for macos 13-15, cuda 12.1-12.4 by @abetlen in ca808028bd16b8327bd84128d48015a4b1304690

[0.3.3]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ce8784bdb153ff7794dde5a50b0ebfa51baa6171
  • fix: chat API logprobs format by @domdomegg in #1788
  • feat: Add support for CUDA 12.6, fix CUDA 12.5 by @Smartappli in #1775
  • fix: Make content not required in ChatCompletionRequestAssistantMessage by @feloy in #1807
  • fix: Fix pickling of Llama class by setting seed from _seed member by @abetlen in 2523472c3eccb9ab9277117cc4ff705212b6888a
  • fix: Fix logit-bias type hint by @ddh0 in #1802
  • fix(server): Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context by @gjpower in #1798
  • fix(server): Added missing exit_stack.close() to /v1/chat/completions by @Ian321 in #1796
  • fix(examples): Refactor Batching notebook to use new sampler chain API by @lukestanley in #1793
  • fix(docs): Update development instructions by @Florents-Tselai in #1833
  • fix(docs): Remove ref to llama_eval in llama_cpp.py docs by @richdougherty in #1819

[0.3.2]

  • feat: Update llama.cpp to ggerganov/llama.cpp@74d73dc85cc2057446bf63cc37ff649ae7cebd80

[0.3.1]

  • feat: Update llama.cpp to ggerganov/llama.cpp@c919d5db39c8a7fcb64737f008e4b105ee0acd20
  • feat: Expose libggml in internal APIs by @abetlen in #1761
  • fix: Fix speculative decoding by @abetlen in 9992c5084a3df2f533e265d10f81d4269b97a1e6 and e975dabf74b3ad85689c9a07719cbb181313139b
  • misc: Rename all_text to remaining_text by @xu-song in #1658

[0.3.0]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ea9c32be71b91b42ecc538bd902e93cbb5fb36cb
  • feat: Enable detokenizing special tokens with special=True by @benniekiss in #1596
  • feat(ci): Speed up CI workflows using uv, add support for CUDA 12.5 wheels by @Smartappli in e529940f45d42ed8aa31334123b8d66bc67b0e78
  • feat: Add loading sharded GGUF files from HuggingFace with Llama.from_pretrained(additional_files=[...]) by @Gnurro in 84c092063e8f222758dd3d60bdb2d1d342ac292e
  • feat: Add option to configure n_ubatch by @abetlen in 6c44a3f36b089239cb6396bb408116aad262c702
  • feat: Update sampling API for llama.cpp. Sampling now uses sampler chain by @abetlen in f8fcb3ea3424bcfba3a5437626a994771a02324b
  • fix: Don't store scores internally unless logits_all=True. Reduces memory requirements for large context by @abetlen in 29afcfdff5e75d7df4c13bad0122c98661d251ab
  • fix: Fix memory allocation of ndarray in by @xu-song in #1704
  • fix: Use system message in og qwen format by @abetlen in 98eb092d3c6e7c142c4ba2faaca6c091718abbb3

[0.2.90]

  • feat: Update llama.cpp to ggerganov/llama.cpp@1d1ccce67613674c75c9c7e3fa4c1e24e428ba48
  • feat: Add support for MiniCPMv26ChatHandler and minicpm-v-26 in server by @abetlen in f70df824985d875226793b94dacc0c302a4256b2

[0.2.89]

  • feat: Update llama.cpp to ggerganov/llama.cpp@cfac111e2b3953cdb6b0126e67a2487687646971
  • fix: Llama.close didn't free lora adapter by @jkawamoto in #1679
  • fix: missing dependencies for test by @jkawamoto in #1680

[0.2.88]

  • feat: Update llama.cpp to ggerganov/llama.cpp@fc4ca27b25464a11b3b86c9dbb5b6ed6065965c2
  • fix: only print 'cache saved' in verbose mode by @lsorber in #1668
  • fix: Added back from_file method to LlamaGrammar by @ExtReMLapin in #1673
  • fix: grammar prints on each call by @abetlen in 0998ea0deea076a547d54bd598d6b413b588ee2b
  • feat: Enable recursive search of HFFS.ls when using from_pretrained by @benHeidabetlen in #1656
  • feat: Add more detailed log for prefix-match by @xu-song in #1659

[0.2.87]

  • feat: Update llama.cpp to ggerganov/llama.cpp@be55695eff44784a141a863f273661a6bce63dfc
  • fix: Include all llama.cpp source files and subdirectories by @abetlen in 9cad5714ae6e7c250af8d0bbb179f631368c928b
  • feat(ci): Re-build wheel index automatically when releases are created by @abetlen in 198f47dc1bd202fd2b71b29e041a9f33fe40bfad

[0.2.86]

  • feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
  • feat: Ported back new grammar changes from C++ to Python implementation by @ExtReMLapin in (#1637)
  • fix: llama_grammar_accept_token arg order by @tc-wolf in (#1649)

[0.2.85]

  • feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
  • fix: Missing LoRA adapter after API change by @shamitv in #1630
  • fix(docker): Update Dockerfile BLAS options by @olivierdebauche in #1632
  • fix(docker): Fix GGML_CUDA param by @olivierdebauche in #1633
  • fix(docker): Update Dockerfile build options from LLAMA_ to GGML_ by @olivierdebauche in #1634
  • feat: FreeBSD compatibility by @yurivict in #1635

[0.2.84]

  • feat: Update llama.cpp to ggerganov/llama.cpp@4730faca618ff9cee0780580145e3cbe86f24876
  • fix: fix: Correcting run.sh filepath in Simple Docker implementation by @mashuk999 in #1626

[0.2.83]

  • feat: Update llama.cpp to ggerganov/llama.cpp@081fe431aa8fb6307145c4feb3eed4f48cab19f8
  • feat: Add 'required' literal to ChatCompletionToolChoiceOption by @mjschock in #1597
  • fix: Change repeat_penalty to 1.0 to match llama.cpp defaults by @ddh0 in #1590
  • fix(docs): Update README.md typo by @ericcurtin in #1589
  • fix(server): Use split_mode from model settings by @grider-withourai in #1594
  • feat(ci): Dockerfile update base images and post-install cleanup by @Smartappli in #1530

[0.2.82]

  • feat: Update llama.cpp to ggerganov/llama.cpp@7fdb6f73e35605c8dbc39e9f19cd9ed84dbc87f2

[0.2.81]

  • feat: Update llama.cpp to ggerganov/llama.cpp@968967376dc2c018d29f897c4883d335bbf384fb
  • fix(ci): Fix CUDA wheels, use LLAMA_CUDA instead of removed LLAMA_CUBLAS by @abetlen in 4fb6fc12a02a68884c25dd9f6a421cacec7604c6
  • fix(ci): Fix MacOS release, use macos-12 image instead of removed macos-11 by @abetlen in 3a551eb5263fdbd24b36d7770856374c04e92788

[0.2.80]

  • feat: Update llama.cpp to ggerganov/llama.cpp@023b8807e10bc3ade24a255f01c1ad2a01bb4228
  • fix(server): Fix bug in FastAPI streaming response where dependency was released before request completes causing SEGFAULT by @abetlen in 296304b60bb83689659883c9cc24f4c074dd88ff
  • fix(server): Update default config value for embeddings to False to fix error in text generation where logits were not allocated by llama.cpp by @abetlen in bf5e0bb4b151f4ca2f5a21af68eb832a96a79d75
  • fix(ci): Fix the CUDA workflow by @oobabooga in #1551
  • docs: Update readme examples to use newer Qwen2 model by @jncraton in #1544

[0.2.79]

  • feat: Update llama.cpp to ggerganov/llama.cpp@9c77ec1d74874ee22bdef8f110e8e8d41389abf2
  • feat(ci): Update workflows and pre-built wheels by @Smartappli in #1416
  • feat: Add .close() method to Llama class to explicitly free model from memory by @jkawamoto in #1513
  • feat: Support SPM infill by @CISC in #1492

[0.2.78]

  • feat: Update llama.cpp to ggerganov/llama.cpp@fd5ea0f897ecb3659d6c269ef6f3d833e865ead7
  • fix: Avoid duplicate special tokens in chat formats by @CISC in #1439
  • fix: fix logprobs when BOS is not present by @ghorbani in #1471
  • feat: adding rpc_servers parameter to Llama class by @chraac in #1477

[0.2.77]

  • feat: Update llama.cpp to ggerganov/llama.cpp@bde7cd3cd949c1a85d3a199498ac98e78039d46f
  • fix: string value kv_overrides by @abetlen in df45a4b3fe46e72664bda87301b318210c6d4782
  • fix: Fix typo in Llama3VisionAlphaChatHandler by @abetlen in 165b4dc6c188f8fda2fc616154e111f710484eba
  • fix: Use numpy recarray for candidates data, fixes bug with temp < 0 by @abetlen in af3ed503e9ce60fe6b5365031abad4176a3536b3 fix: Disable Windows+CUDA workaround when compiling for HIPBLAS by Engininja2 in #1493

[0.2.76]

  • feat: Update llama.cpp to ggerganov/llama.cpp@0df0aa8e43c3378975269a51f9b876c8692e70da
  • feat: Improve Llama.eval performance by avoiding list conversion by @thoughtp0lice in #1476
  • example: LLM inference with Ray Serve by @rgerganov in #1465

[0.2.75]

  • feat: Update llama.cpp to ggerganov/llama.cpp@13ad16af1231ab2d245d35df3295bcfa23de1305
  • fix: segfault for models without eos / bos tokens by @abetlen in d99a6ba607a4885fb00e63e967964aa41bdbbbcb
  • feat: add MinTokensLogitProcessor and min_tokens argument to server by @twaka in #1333
  • misc: Remove unnecessary metadata lookups by @CISC in #1448

[0.2.74]

  • feat: Update llama.cpp to ggerganov/llama.cpp@b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2
  • fix: Enable CUDA backend for llava by @abetlen in 7f59856fa6f3e23f07e12fc15aeb9359dc6c3bb4
  • docs: Fix typo in README.md by @yupbank in #1444

[0.2.73]

  • feat: Update llama.cpp to ggerganov/llama.cpp@25c6e82e7a1ad25a42b0894e87d9b5c557409516
  • fix: Clear kv cache at beginning of image chat formats to avoid bug when image is evaluated first by @abetlen in ac55d0a175115d1e719672ce1cb1bec776c738b1

[0.2.72]

  • fix(security): Remote Code Execution by Server-Side Template Injection in Model Metadata by @retr0reg in b454f40a9a1787b2b5659cd2cb00819d983185df
  • fix(security): Update remaining jinja chat templates to use immutable sandbox by @CISC in #1441

[0.2.71]

  • feat: Update llama.cpp to ggerganov/llama.cpp@911b3900dded9a1cfe0f0e41b82c7a29baf3a217
  • fix: Make leading bos_token optional for image chat formats, fix nanollava system message by @abetlen in 77122638b4153e31d9f277b3d905c2900b536632
  • fix: free last image embed in llava chat handler by @abetlen in 3757328b703b2cd32dcbd5853271e3a8c8599fe7

[0.2.70]

  • feat: Update llama.cpp to ggerganov/llama.cpp@c0e6fbf8c380718102bd25fcb8d2e55f8f9480d1
  • feat: fill-in-middle support by @CISC in #1386
  • fix: adding missing args in create_completion for functionary chat handler by @skalade in #1430
  • docs: update README.md @eltociear in #1432
  • fix: chat_format log where auto-detected format prints None by @balvisio in #1434
  • feat(server): Add support for setting root_path by @abetlen in 0318702cdc860999ee70f277425edbbfe0e60419
  • feat(ci): Add docker checks and check deps more frequently by @Smartappli in #1426
  • fix: detokenization case where first token does not start with a leading space by @noamgat in #1375
  • feat: Implement streaming for Functionary v2 + Bug fixes by @jeffrey-fong in #1419
  • fix: Use memmove to copy str_value kv_override by @abetlen in 9f7a85571ae80d3b6ddbd3e1bae407b9f1e3448a
  • feat(server): Remove temperature bounds checks for server by @abetlen in 0a454bebe67d12a446981eb16028c168ca5faa81
  • fix(server): Propagate flash_attn to model load by @dthuerck in #1424

[0.2.69]

  • feat: Update llama.cpp to ggerganov/llama.cpp@6ecf3189e00a1e8e737a78b6d10e1d7006e050a2
  • feat: Add llama-3-vision-alpha chat format by @abetlen in 31b1d95a6c19f5b615a3286069f181a415f872e8
  • fix: Change default verbose value of verbose in image chat format handlers to True to match Llama by @abetlen in 4f01c452b6c738dc56eacac3758119b12c57ea94
  • fix: Suppress all logs when verbose=False, use hardcoded fileno's to work in colab notebooks by @abetlen in f116175a5a7c84569c88cad231855c1e6e59ff6e
  • fix: UTF-8 handling with grammars by @jsoma in #1415

[0.2.68]

  • feat: Update llama.cpp to ggerganov/llama.cpp@77e15bec6217a39be59b9cc83d6b9afb6b0d8167
  • feat: Add option to enable flash_attn to Lllama params and ModelSettings by @abetlen in 22d77eefd2edaf0148f53374d0cac74d0e25d06e
  • fix(ci): Fix build-and-release.yaml by @Smartappli in #1413

[0.2.67]

  • fix: Ensure image renders before text in chat formats regardless of message content order by @abetlen in 3489ef09d3775f4a87fb7114f619e8ba9cb6b656
  • fix(ci): Fix bug in use of upload-artifact failing to merge multiple artifacts into a single release by @abetlen in d03f15bb73a1d520970357b702a9e7d4cc2a7a62

[0.2.66]

  • feat: Update llama.cpp to ggerganov/llama.cpp@8843a98c2ba97a25e93319a104f9ddfaf83ce4c4
  • feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) by @abetlen in #1147
  • ci(fix): Workflow actions updates and fix arm64 wheels not included in release by @Smartappli in #1392
  • ci: Add support for pre-built cuda 12.4.1 wheels by @Smartappli in #1388
  • feat: Add support for str type kv_overrides by @abetlen in a411612b385cef100d76145da1fbd02a7b7cc894
  • fix: Functionary bug fixes by @jeffrey-fong in #1385
  • examples: fix quantize example by @iyubondyrev in #1387
  • ci: Update dependabot.yml by @Smartappli in #1391

[0.2.65]

  • feat: Update llama.cpp to ggerganov/llama.cpp@46e12c4692a37bdd31a0432fc5153d7d22bc7f72
  • feat: Allow for possibly non-pooled embeddings by @iamlemec in #1380

[0.2.64]

  • feat: Update llama.cpp to ggerganov/llama.cpp@4e96a812b3ce7322a29a3008db2ed73d9087b176
  • feat: Add llama-3 chat format by @andreabak in #1371
  • feat: Use new llama_token_is_eog in create_completions by @abetlen in d40a250ef3cfaa8224d12c83776a2f1de96ae3d1
  • feat(server): Provide ability to dynamically allocate all threads if desired using -1 by @sean-bailey in #1364
  • ci: Build arm64 wheels by @gaby in 611781f5319719a3d05fefccbbf0cc321742a026
  • fix: Update scikit-build-core build dependency avoid bug in 0.9.1 by @evelkey in #1370

[0.2.63]

  • feat: Update llama.cpp to ggerganov/llama.cpp@0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
  • feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct by @abetlen in cc81afebf04d26ca1ac3cf72f23f18da6ab58588

[0.2.62]

  • feat: Update llama.cpp to ggerganov/llama.cpp@3b8f1ec4b18770531d0b1d792f3edf08254e4f0c
  • feat: update grammar schema converter to match llama.cpp by @themrzmaster in #1353
  • feat: add disable_ping_events flag by @khimaros in #1257
  • feat: Make saved state more compact on-disk by @tc-wolf in #1296
  • feat: Use all available CPUs for batch processing by @ddh0 in #1345

[0.2.61]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ba5e134e073ec6837078c874aba44a702944a676
  • fix: pass correct type to chat handlers for chat completion logprobs by @abetlen in bb65b4d76411112c6fb0bf759efd746f99ef3c6b
  • feat: Add support for yaml based server configs by @abetlen in 060bfa64d529ade2af9b1f4e207a3937bbc4138f
  • feat: Add typechecking for ctypes structure attributes by @abetlen in 1347e1d050fc5a9a32ffe0bb3e22858da28003bd

[0.2.60]

  • feat: Update llama.cpp to ggerganov/llama.cpp@75cd4c77292034ecec587ecb401366f57338f7c0
  • fix: Always embed metal library by @abetlen in b3bfea6dbfb6ed9ce18f9a2723e0a9e4bd1da7ad
  • fix: missing logprobs in response, incorrect response type for functionary by @abetlen in 1ae3abbcc3af7f4a25a3ffc40b246f18039565e8
  • fix(docs): incorrect tool_choice example by @CISC in #1330

[0.2.59]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
  • feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal by @abetlen, @jllllll, and @oobabooga in #1247
  • fix: segfault when logits_all=False by @abetlen in 8649d7671bd1a7c0d9cc6a5ad91c6ca286512ab3
  • fix: last tokens passing to sample_repetition_penalties function by @ymikhailov in #1295

[0.2.58]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
  • feat: add support for KV cache quantization options by @Limour-dev in #1307
  • feat: Add logprobs support to chat completions by @windspirit95 in #1311
  • fix: set LLAMA_METAL_EMBED_LIBRARY=on on MacOS arm64 by @bretello in #1289
  • feat: Add tools/functions variables to Jinja2ChatFormatter, add function response formatting for all simple chat formats by @CISC in #1273
  • fix: Changed local API doc references to hosted by by @lawfordp2017 in #1317

[0.2.57]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ac9ee6a4ad740bc1ee484ede43e9f92b5af244c1
  • fix: set default embedding pooling type to unspecified by @abetlen in 4084aabe867b8ec2aba1b22659e59c9318b0d1f3
  • fix: Fix and optimize functionary chat handler by @jeffrey-fong in #1282
  • fix: json mode for basic chat formats by @abetlen in 20e6815252d0efd9f015f7adbf108faaf36e3f3c

[0.2.56]

  • feat: Update llama.cpp to ggerganov/llama.cpp@c2101a2e909ac7c08976d414e64e96c90ee5fa9e
  • feat(server): Add endpoints for tokenize, detokenize and count tokens by @felipelo in #1136
  • feat: Switch embed to llama_get_embeddings_seq by @iamlemec in #1263
  • fix: Fixed json strings grammar by blacklisting character control set by @ExtReMLapin in d02a9cf16ff88ad011e2eb1ce29f4d9400f13cd1
  • fix: Check for existence of clip model path by @kejcao in #1264

[0.2.55]

  • feat: Update llama.cpp to ggerganov/llama.cpp@9731134296af3a6839cd682e51d9c2109a871de5
  • docs: fix small typo in README: 'model know how' -> 'model knows how' by @boegel in #1244

[0.2.54]

  • feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
  • docs: fix typo in README.md embeddings example by @iamlemec in #1232

[0.2.53]

  • feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
  • fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter by @CISC in #1230

[0.2.52]

  • feat: Update llama.cpp to ggerganov/llama.cpp@a33e6a0d2a66104ea9a906bdbf8a94d050189d91
  • fix: Llava15ChatHandler (this function takes at least 4 arguments) by @abetlen in 8383a9e5620f5df5a88f62da16813eac200dd706

[0.2.51]

  • feat: Update llama.cpp to ggerganov/llama.cpp@c39373398803c669056304090050fe3f44b41bf9
  • fix: Restore type hints for low-level api by @abetlen in 19234aa0dbd0c3c87656e65dd2b064665371925b

[0.2.50]

  • docs: Update Functionary OpenAI Server Readme by @jeffrey-fong in #1193
  • fix: LlamaHFTokenizer now receives pre_tokens by @abetlen in 47bad30dd716443652275099fa3851811168ff4a

[0.2.49]

  • fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8' in Llama.save_state by @abetlen in db776a885cd4c20811f22f8bd1a27ecc71dba927
  • feat: Auto detect Mixtral's slightly different format by @lukestanley in #1214

[0.2.48]

  • feat: Update llama.cpp to ggerganov/llama.cpp@15499eb94227401bdc8875da6eb85c15d37068f7
  • feat: Add Google's Gemma formatting via chat_format="gemma" by @alvarobartt in #1210
  • feat: support minItems/maxItems in JSON grammar converter by @nopperl in 3921e10770996d95a9eb22c8248bacef39f69365
  • fix: Update from_pretrained defaults to match hf_hub_download and pull to local cache folder by @abetlen in e6d6260a91b7831733f7d1f73c7af46a3e8185ed
  • fix: Raise exceptions when llama model or context fails to load by @abetlen in dd22010e85265ae840c76ec835d67a29ed852722
  • docs: Update README.md to fix pip install llama cpp server by @audip in #1187

[0.2.47]

  • feat: Update llama.cpp to ggerganov/llama.cpp@973053d8b0d04809836b3339a50f68d9c842de90

[0.2.46]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ba2135ccae7462470b3865c6e41d2e1d734eac05
  • feat: Pull models directly from huggingface by @abetlen in #1206
  • feat(low-level-api): Improve API static type-safety and performance. Low level api functions are positional args only now. by @abetlen in #1205

[0.2.45]

  • feat: Update llama.cpp to ggerganov/llama.cpp@89febfed9322c8849520dc63c93ee4f5fd72556e

[0.2.44]

  • feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051
  • fix: create_embedding broken response for input type str by @abetlen in 0ce66bc080fe537590b05b24bf442480bf2dd045
  • fix: Use '\n' seperator for EventSourceResponse by @khimaros in #1188
  • fix: Incorporate embedding pooling layer fixes by @iamlemec in #1194

[0.2.43]

  • feat: Update llama.cpp to ggerganov/llama.cpp@8084d554406b767d36b3250b3b787462d5dd626f
  • feat: Support batch embeddings by @iamlemec in #1186
  • fix: submodule kompute is not included in sdist by @abetlen in 7dbbfdecadebe7750be650d9409959640ff9a460
  • fix: fix: Update openbuddy prompt format by @abetlen in 07a783779a62a4aac0b11161c7e0eb983ff215f8

[0.2.42]

  • feat: Update llama.cpp to ggerganov/llama.cpp@ea9c8e11436ad50719987fa23a289c74b7b40d40
  • fix: sample idx off-by-one error for logit_processors by @lapp0 in #1179
  • fix: chat formatting bugs in chatml-function-calling by @abetlen in 4b0e3320bd8c2c209e29978d0b21e2e471cc9ee3 and 68fb71b6a26a1e57331868f959b47ab4b87851e1

[0.2.41]

  • feat: Update llama.cpp to ggerganov/llama.cpp@895407f31b358e3d9335e847d13f033491ec8a5b
  • fix: Don't change order of json schema object properties in generated grammar unless prop_order is passed by @abetlen in d1822fed6b706f38bd1ff0de4dec5baaa3cf84fa

[0.2.40]

  • feat: Update llama.cpp to ggerganov/llama.cpp@3bdc4cd0f595a6096cca4a64aa75ffa8a3503465
  • feat: Generic chatml Function Calling using chat_format="chatml-function-calling"` by @abetlen in #957
  • fix: Circular dependancy preventing early Llama object free by @notwa in #1176
  • docs: Set the correct command for compiling with syscl support by @akarshanbiswas in #1172
  • feat: use gpu backend for clip if available by @iamlemec in #1175

[0.2.39]

  • feat: Update llama.cpp to ggerganov/llama.cpp@b08f22c882a1443e6b97081f3ce718a4d1a741f8
  • fix: Fix destructor logging bugs by using llama_log_callback to avoid suppress_stdout_stderr by @abetlen in 59760c85eddc72dfcc1839f43760ef72c23d6874

[0.2.38]

  • feat: Update llama.cpp to ggerganov/llama.cpp@1cfb5372cf5707c8ec6dde7c874f4a44a6c4c915
  • feat: Add speculative decoding by @abetlen in #1120
  • fix: Pass raise_exception and add_generation_prompt to jinja2 chat template by @abetlen in 078cca0361bf5a94d2cf52ed04980d20e32d6f95

[0.2.37]

  • feat: Update llama.cpp to ggerganov/llama.cpp@fea4fd4ba7f6b754ac795387b275e1a014a77bde
  • feat: Automatically set chat format from gguf by @abetlen in #1110

[0.2.36]

  • feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
  • feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799

[0.2.35]

  • feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00

[0.2.34]

  • feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
  • feat: Add json schema mode by @abetlen in #1122

[0.2.33]

  • feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
  • feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5
  • fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f
  • fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116
  • fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7

[0.2.32]

  • feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2
  • fix: from_json_schema oneof/anyof bug by @jndiogo in d3f5528ca8bcb9d69d4f27e21631e911f1fb9bfe
  • fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats by @abetlen in 24f39454e91cf5dddbc4b6041aead4accc7c7a2d
  • feat: Add add_generation_prompt option for jinja2chatformatter by @abetlen in 7f3209b1eb4ad3260ba063801fab80a8c25a2f4c
  • feat: Add Jinja2ChatFormatter by @abetlen in be09318c26add8674ce494ae7cc480cce72a4146
  • feat: Expose gguf model metadata in metadata property by @abetlen in 5a34c57e5479e50c99aba9b38218cc48e6560b81

[0.2.31]

  • feat: Update llama.cpp to ggerganov/llama.cpp@a5cacb22b2114fd9adf61c00cbb237384d86bced
  • fix: Mirostat sampling now passes correct type to ctypes and tracks state during generation by @abetlen in 3babe3512cb95743108f2b595210c38ed6f1b904
  • fix: Python3.8 support in server by @abetlen in 141293a75b564a8699e0acba1da24d9aa1cf0ab1

[0.2.30]

  • feat: Update llama.cpp to ggerganov/llama.cpp@57e2a7a52a819883f40dada8a2edc24ecf48186b
  • feat(server): Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files by @abetlen in b8fc1c7d83ad4a9207c707ba1d954fe580286a01
  • feat: Integration of Jinja2 Templating for chat formats by @teleprint-me in #875
  • fix: Offload KQV by default by @abetlen in 48c3b77e6f558a9899de0e1155c7dc0c7958d8e8
  • fix: Support Accept text/event-stream in chat and completion endpoints, resolves #1083 by @aniljava in #1088
  • fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 by @K-Mistele in #1093

[0.2.29]

  • feat: Update llama.cpp to ggerganov/llama.cpp@4483396751c79dea540808b9cb9238245d06da2b
  • feat: Add split_mode option by @abetlen in 84615adbc6855c8384807c42f0130f9a1763f99d
  • feat: Implement GGUF metadata KV overrides by @phiharri in #1011
  • fix: Avoid "LookupError: unknown encoding: ascii" when open() called in a destructor by @yieldthought in #1012
  • fix: Fix low_level_api_chat_cpp example to match current API by @aniljava in #1086
  • fix: Fix Pydantic model parsing by @DeNeutoy in #1087

[0.2.28]

  • feat: Update llama.cpp to ggerganov/llama.cpp@6efb8eb30e7025b168f3fda3ff83b9b386428ad6
  • feat: Add ability to pass in penalize_nl param by @shankinson in #1068
  • fix: print_grammar to stderr by @turian in #1052

[0.2.27]

  • feat: Update llama.cpp to ggerganov/llama.cpp@b3a7c20b5c035250257d2b62851c379b159c899a
  • feat: Add saiga chat format by @femoiseev in #1050
  • feat: Added chatglm3 chat format by @xaviviro in #1059
  • fix: Correct typo in README.md by @qeleb in (#1058)

[0.2.26]

  • feat: Update llama.cpp to ggerganov/llama.cpp@f6793491b5af6da75edad34d6f503ef86d31b09f

[0.2.25]

  • feat(server): Multi model support by @D4ve-R in #931
  • feat(server): Support none defaulting to infinity for completions by @swg in #111
  • feat(server): Implement openai api compatible authentication by @docmeth2 in #1010
  • fix: text_offset of multi-token characters by @twaka in #1037
  • fix: ctypes bindings for kv override by @phiharri in #1011
  • fix: ctypes definitions of llama_kv_cache_view_update and llama_kv_cache_view_free. by @e-c-d in #1028

[0.2.24]

  • feat: Update llama.cpp to ggerganov/llama.cpp@0e18b2e7d0b5c0a509ea40098def234b8d4a938a
  • feat: Add offload_kqv option to llama and server by @abetlen in 095c65000642a3cf73055d7428232fb18b73c6f3
  • feat: n_ctx=0 now uses the n_ctx_train of the model by @DanieleMorotti in #1015
  • feat: logits_to_logprobs supports both 2-D and 3-D logits arrays by @kddubey in #1002
  • fix: Remove f16_kv, add offload_kqv fields in low level and llama apis by @brandonrobertz in #1019
  • perf: Don't convert logprobs arrays to lists by @kddubey in #1021
  • docs: Fix README.md functionary demo typo by @evelynmitchell in #996
  • examples: Update low_level_api_llama_cpp.py to match current API by @jsoma in #1023

[0.2.23]

  • Update llama.cpp to ggerganov/llama.cpp@948ff137ec37f1ec74c02905917fa0afc9b97514
  • Add qwen chat format by @yhfgyyf in #1005
  • Add support for running the server with SSL by @rgerganov in #994
  • Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp by @player1537 in #991
  • Fix UnsupportedOperation: fileno in suppress_stdout_stderr by @zocainViken in #961
  • Add Pygmalion chat format by @chiensen in #986
  • README.md multimodal params fix by @zocainViken in #967
  • Fix minor typo in README by @aniketmaurya in #958

[0.2.22]

  • Update llama.cpp to ggerganov/llama.cpp@8a7b2fa528f130631a5f43648481596ab320ed5a
  • Fix conflict with transformers library by kddubey in #952

[0.2.21]

  • Update llama.cpp to ggerganov/llama.cpp@64e64aa2557d97490b2fe1262b313e2f4a1607e3
  • Make building llava optional by setting CMAKE_ARGS="-DLLAVA_BUILD=OFF" and using LLAVA_CPP_LIB to specify alternative path to shared library by @abetlen in e3941d9c674dbd9891dc3ceda390daeb21f05fd1

[0.2.20]

  • Update llama.cpp to ggerganov/llama.cpp@b38a16dfcff88d547f78f52d1bea31b84a05aff7
  • Add zephyr chat format by @fakerybakery in #938
  • Add baichuan chat format by @caiyesd in #938
  • Add baichuan-2 chat format by @caiyesd in #936
  • Improve documentation for server chat formats by @jooray in #934
  • Fix typo in README by @antonvice in 940
  • Fix typo in the Open Orca chat format by @gardner in #947

[0.2.19]

  • Update llama.cpp to ggerganov/llama.cpp@0b871f1a04ef60e114bbe43004fd9c21114e802d
  • Fix #569: stop parameter in chat completion api should accept str by @abetlen in 128dc4731fa846ead7e684a137ca57d8931b8899
  • Document server host and port parameters by @jamesbraza in #768
  • Do not set grammar to None when initializing LlamaGrammar by @mthuurne in #834
  • Add mistrallite, intel, and openchat formats by @fakerybakery in #927
  • Add support for min_p parameter by @tk-master in #921
  • Fix #929: tokenizer adding leading space when generating from empty prompt by @abetlen in a34d48014192771d2e308a76c22f33bc0318d983
  • Fix low level api example by @zocainViken in #925
  • Fix missing package in openblas docker image by @ZisisTsatsas in #920

[0.2.18]

  • Update llama.cpp to ggerganov/llama.cpp@6bb4908a17150b49373b5f977685b2e180a04f6f

[0.2.17]

  • Update llama.cpp to ggerganov/llama.cpp@df9d1293defe783f42bc83af732d3c670552c541
  • Hotfix: Set CUDA_ARCHITECTURES=OFF for llava_shared target on Windows by @abetlen in 4388f3341413110217b98c4f097ac5c590bdf40b

[0.2.16]

  • Update llama.cpp to ggerganov/llama.cp@a75fa576abba9d37f463580c379e4bbf1e1ad03c
  • Add set_seed to Llama class by @abetlen in fd41ed3a908761d286102a019a34c2938a15118d
  • Fix server doc arguments by @kjunggithub in #892
  • Fix response_format handler in llava chat handler by @abetlen in b62c44983921197ed10a7d29dc4ba920e9979380
  • Fix default max_tokens, chat completion is now unlimited (to context length) and completion is 16 tokens to match OpenAI defaults by @abetlen in e7962d2c733cbbeec5a37392c81f64185a9a39e8
  • Fix json_schema_to_gbnf helper so that it takes a json schema string as input instead by @abetlen in faeae181b1e868643c0dc28fcf039f077baf0829
  • Add support for $ref and $def in json_schema_to_gbnf to handle more complex function schemas by @abetlen in 770df344369c0630df1be14be9f9e301e7c56d24
  • Update functionary chat handler for new OpenAI api by abetlen in 1b376c62b775b401653facf25a519d116aafe99a
  • Fix add default stop sequence to chatml chat format by @abetlen in b84d76a844149216d511cfd8cdb9827148a1853c
  • Fix sampling bug when logits_all=False by @abetlen in 6f0b0b1b840af846938ed74d0e8170a91c40e617

[0.2.15]

  • Update llama.cpp to ggerganov/llama.cpp@0a7c980b6f94a049cb804573df2d8092a34df8e4
  • Add support for Llava1.5 multimodal models by @damian0815 and @abetlen in #821
  • Update OpenAI API compatibility to match dev day update by @abetlen in #821
  • Add seed parameter to completion and chat_completion functions of Llama class by @abetlen in 86aeb9f3a14808575d2bb0076e6acb4a30907e6a
  • Add JSON mode support to constrain chat completion to JSON objects by @abetlen in b30b9c338bf9af316d497ea501d39f5c246900db

[0.2.14]

  • Update llama.cpp to ggerganov/llama.cpp@f0b30ef7dc1360922ccbea0a8cd3918ecf15eaa7
  • Add support for Huggingface Autotokenizer Chat Formats by @bioshazard and @abetlen in #790 and bbffdaebaa7bb04b543dbf683a07276087251f86
  • Fix llama-2 chat format by @earonesty in #869
  • Add support for functionary chat format by @abetlen in #784
  • Migrate inference from deprecated llama_evalAPI to llama_batch and llama_decode by @abetlen in #795

[0.2.13]

  • Update llama.cpp to ggerganov/llama.cpp@51b2fc11f7f605fff49725a4540e9a6ef7b51b70
  • Fix name 'open' is not defined exception when deleting model by @abetlen in 011b95d7f34cbfc528af75a892757bd9a20838ab
  • Fix tokenization of special characters by @antoine-lizee in #850

[0.2.12]

  • Update llama.cpp to ggerganov/llama.cpp@50337961a678fce4081554b24e56e86b67660163
  • Fix missing n_seq_id in llama_batch by @NickAlgra in #842
  • Fix for shared libraries on Windows that start with lib prefix by @sujeendran in #848
  • Fix exception raised in __del__ when freeing models by @cebtenzzre in #846
  • Performance improvement for logit bias by @zolastro in #851
  • Fix suffix check arbitrary code execution bug by @mtasic85 in #854
  • Fix typo in function_call parameter in llama_types.py by @akatora28 in #849
  • Fix streaming not returning finish_reason by @gmcgoldr in #798
  • Fix n_gpu_layers check to allow values less than 1 for server by @hxy9243 in #826
  • Supppress stdout and stderr when freeing model by @paschembri in #803
  • Fix llama2 chat format by @delock in #808
  • Add validation for tensor_split size by @eric1932 #820
  • Print stack trace on server error by @abetlen in d6a130a052db3a50975a719088a9226abfebb266
  • Update docs for gguf by @johnccshen in #783
  • Add chatml chat format by @abetlen in 305482bd4156c70802fc054044119054806f4126

[0.2.11]

  • Fix bug in llama_model_params object has no attribute logits_all by @abetlen in d696251fbe40015e8616ea7a7d7ad5257fd1b896

[0.2.10]

  • Fix bug 'llama_model_params' object has no attribute 'embedding' by @abetlen in 42bb721d64d744242f9f980f2b89d5a6e335b5e4

[0.2.9]

  • Fix critical bug in pip installation of v0.2.8 due to .git directory in ac853e01e1a217a578080a4e1b851d2d08450adf

[0.2.8]

  • Update llama.cpp to ggerganov/llama.cpp@40e07a60f9ce06e79f3ccd4c903eba300fb31b5e
  • Add configurable chat formats by @abetlen in #711
  • Fix rope scaling bug by @Josh-XT in #767
  • Fix missing numa parameter in server by @abetlen in d9bce17794d0dd6f7962d10aad768fedecf3ab89

[0.2.7]

  • Update llama.cpp to ggerganov/llama.cpp@a98b1633d5a94d0aa84c7c16e1f8df5ac21fc850
  • Install required runtime dlls to package directory on windows by @abetlen in 8d75016549e2ff62a511b1119d966ffc0df5c77b
  • Add openai-processing-ms to server response header by @Tradunsky in #748
  • Bump minimum version of scikit-build-core to 0.5.1 to fix msvc cmake issue by @abetlen in 1ed0f3ebe16993a0f961155aa4b2c85f1c68f668
  • Update llama_types.py to better match the openai api, old names are aliased to new ones by @abetlen in dbca136feaaf7f8b1182c4c3c90c32918b1d0bb3

[0.2.6]

  • Update llama.cpp to 80291a1d02a07f7f66666fb576c5b1e75aa48b46

[0.2.5]

  • Fix docker images missing starlette-context dependency by @abetlen in 22917989003c5e67623d54ab45affa1e0e475410
  • Fix loading dll in Windows Isolation Containers by @abetlen in 847466562573191efa655753d9252f308c4fbdb0
  • Fix build issue on m1 macs by @abetlen in dbd3a6d1ed8416a8fd800127251e730153afa305
  • Update docs to gguf and add hw acceleration docs for server by @jasonacox in #688

[0.2.4]

  • Add NUMA support. NOTE low level api users must call llama_backend_init at the start of their programs by abetlen in f4090a0bb2a2a25acfe28d31c82cc1aa273bedee
  • Fix tensor_split server cli argument by @abetlen in c4c440ba2dc86d9de728a751311fdd1c8e3756fa
  • Made all Llama init parameters into keyword-only parameters by @abetlen in c8f9b8a734b5b040379bbd93995ba177affab1fe
  • Added server params for low_vram, main_gpu, lora_base, and lora_path by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
  • Removed server params for rms_norm_eps and n_gqa by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
  • Fix boolean cli options by @abetlen in c999325e8e4507f6c6249dd2fb8de7f8bf57f71e and 0449d29b9f940e437231a07b9d56550226558bac
  • Silence Pydantic Settings warnings about model_alias setting by @earonesty in #705

[0.2.3]

  • Update llama.cpp to ggerganov/llama.cpp@71ca2fad7d6c0ef95ef9944fb3a1a843e481f314
  • Add X-Request-ID request header for mirroring custom IDs by @devrimcavusoglu in #703
  • Add pyproject extra for scikit-build-core to ensure compatible pathspec version by @abetlen in 6cfc54284b99ef1bff8193e2d5e483dbd89ada02
  • Fix issue with Literal and Optional cli arguments not working by @abetlen in #702

[0.2.2]

  • Fix bug in pip install of v0.2.1 due to scikit-build-core removing all .metal files in the source distribution (see #701)

[0.2.1]

  • Fix bug in pip install of v0.2.0 due to .git folder being included in the source distribution (see #701)

[0.2.0]

  • Migrated to scikit-build-core build system by @abetlen in #499
  • Use numpy views for LogitsProcessor and StoppingCriteria instead of python lists by @abetlen in #499
  • Drop support for end-of-life Python3.7 by @abetlen in #499
  • Convert low level llama.cpp constants to use basic python types instead of ctypes types by @abetlen in #499

[0.1.85]

  • Add llama_cpp.__version__ attribute by @janvdp in #684
  • Fix low level api examples by @jbochi in #680

[0.1.84]

  • Update llama.cpp

[0.1.83]

  • Update llama.cpp

[0.1.82]

  • Update llama.cpp

[0.1.81]

  • Update llama.cpp

[0.1.80]

  • Update llama.cpp

[0.1.79]

  • GGUF Support (breaking change requiring new model format)

[0.1.78]

  • Grammar based sampling via LlamaGrammar which can be passed to completions
  • Make n_gpu_layers == -1 offload all layers

[0.1.77]

  • (llama.cpp) Update llama.cpp add support for LLaMa 2 70B
  • (server) Add temporary n_gqa and rms_norm_eps parameters required for LLaMa 2 70B

[0.1.76]

  • (llama.cpp) Update llama.cpp add support for LLaMa 2 70B

[0.1.75]

  • Update llama.cpp

[0.1.74]

  • (server) OpenAI style error responses

[0.1.73]

  • (server) Add rope parameters to server settings

[0.1.72]

  • (llama.cpp) Update llama.cpp added custom_rope for extended context lengths

[0.1.71]

  • (llama.cpp) Update llama.cpp

  • (server) Fix several pydantic v2 migration bugs

[0.1.70]

  • (Llama.create_completion) Revert change so that max_tokens is not truncated to context_size in create_completion
  • (server) Fixed changed settings field names from pydantic v2 migration

[0.1.69]

  • (server) Streaming requests can are now interrupted pre-maturely when a concurrent request is made. Can be controlled with the interrupt_requests setting.
  • (server) Moved to fastapi v0.100.0 and pydantic v2
  • (docker) Added a new "simple" image that builds llama.cpp from source when started.
  • (server) performance improvements by avoiding unnecessary memory allocations during sampling

[0.1.68]

  • (llama.cpp) Update llama.cpp

[0.1.67]

  • Fix performance bug in Llama model by pre-allocating memory tokens and logits.
  • Fix bug in Llama model where the model was not free'd after use.

[0.1.66]

  • (llama.cpp) New model API

  • Performance issue during eval caused by looped np.concatenate call

  • State pickling issue when saving cache to disk

[0.1.65]

  • (llama.cpp) Fix struct misalignment bug

[0.1.64]

  • (llama.cpp) Update llama.cpp
  • Fix docs for seed. Set -1 for random.

[0.1.63]

  • (llama.cpp) Add full gpu utilisation in CUDA
  • (llama.cpp) Add get_vocab
  • (llama.cpp) Add low_vram parameter
  • (server) Add logit_bias parameter

[0.1.62]

  • Metal support working
  • Cache re-enabled

[0.1.61]

  • Fix broken pip installation

[0.1.60]

NOTE: This release was deleted due to a bug with the packaging system that caused pip installations to fail.

  • Truncate max_tokens in create_completion so requested tokens doesn't exceed context size.
  • Temporarily disable cache for completion requests

[v0.1.59]

  • (llama.cpp) k-quants support
  • (server) mirostat sampling parameters to server
  • Support both .so and .dylib for libllama on MacOS

[v0.1.58]

  • (llama.cpp) Metal Silicon support

[v0.1.57]

  • (llama.cpp) OpenLlama 3B support

[v0.1.56]

  • (misc) Added first version of the changelog
  • (server) Use async routes
  • (python-api) Use numpy for internal buffers to reduce memory usage and improve performance.
  • (python-api) Performance bug in stop sequence check slowing down streaming.