Skip to content

eugenehp/llama-cpp-rs

Repository files navigation

🦙 llama-cpp-rs

Crates.io docs.rs License

Safe Rust bindings to llama.cpp, tracking upstream closely.

Crate Description crates.io
llama-cpp-4 Safe high-level API
llama-cpp-sys-4 Raw bindgen bindings

llama.cpp version: c30e01225 (April 2026) — includes TurboQuant (PR #21038)


Examples

Package name Directory Description
simple examples/simple/ Single-turn text completion from CLI or Hugging Face
chat examples/chat/ Interactive multi-turn chat REPL
embeddings examples/embeddings/ Batch embedding with cosine similarity
split-model-example examples/split_model/ Load sharded / split GGUF files
openai-server examples/server/ OpenAI-compatible HTTP server with streaming and tool calling
mtmd examples/mtmd/ Multimodal (vision / audio) inference (requires --features mtmd)
quantize examples/quantize/ Quantize a GGUF model with full typed API
turbo-quant examples/turbo-quant/ TurboQuant demo — compare attn rotation on/off
incremental-chat examples/incremental-chat/ Chat with incremental prefill — processes tokens while you type

Quick start

git clone --recursive https://github.com/eugenehp/llama-cpp-rs
cd llama-cpp-rs

Interactive chat

cargo run -p chat -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

OpenAI-compatible server

# Starts on http://127.0.0.1:8080
cargo run -p openai-server -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

Using prebuilt native libraries (skip CMake compile)

llama-cpp-sys-4 can consume precompiled llama/ggml libraries via env vars. This is useful for CI pipelines that publish native artifacts once and reuse them in downstream repos (for example, speeding up a separate app build).

# Directory containing prebuilt libs in one of:
#   <dir>, <dir>/lib, <dir>/lib64, <dir>/bin
export LLAMA_PREBUILT_DIR=/path/to/prebuilt

# Optional: force dynamic linking mode for prebuilt artifacts.
# Defaults to the crate's normal link mode for the active feature set.
# export LLAMA_PREBUILT_SHARED=1

cargo build -p your-app --features "q1,vulkan"

Notes:

  • q1 compatibility is determined by the prebuilt artifact itself — publish separate artifacts per feature/backend tuple (q1+vulkan, q1+metal, ...).
  • build.rs still generates Rust bindings, but skips the expensive CMake compile when LLAMA_PREBUILT_DIR is set.

Backend feature coverage (practical targets):

  • metal → macOS (Apple Silicon and Intel Macs)
  • vulkan → Linux/Windows (cross-vendor desktop GPUs)
  • webgpu → Linux/Windows (experimental; requires Dawn/WebGPU-native stack)
  • cuda → Linux/Windows with NVIDIA CUDA toolkit (experimental in CI)
  • hip → Linux ROCm/HIP environments (experimental in CI)
  • opencl → Linux/Windows with OpenCL SDK/runtime (experimental in CI)
  • blas → CPU acceleration (Linux/macOS/Windows)
# Chat completion
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}], "max_tokens":128}'

# Streaming
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Count to 5"}], "stream":true}'

# Embeddings
curl http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["Hello world", "Bonjour le monde"]}'

Text generation (library)

use llama_cpp_4::{
    llama_backend::LlamaBackend,
    llama_batch::LlamaBatch,
    model::{params::LlamaModelParams, AddBos, LlamaModel, Special},
    context::params::LlamaContextParams,
    sampling::LlamaSampler,
};
use std::num::NonZeroU32;

let backend = LlamaBackend::init()?;
let model = LlamaModel::load_from_file(&backend, "model.gguf", &LlamaModelParams::default())?;
let mut ctx = model.new_context(&backend, LlamaContextParams::default())?;

let tokens = model.str_to_token("Hello, world!", AddBos::Always)?;
let mut batch = LlamaBatch::new(512, 1);
for (i, &tok) in tokens.iter().enumerate() {
    batch.add(tok, i as i32, &[0], i == tokens.len() - 1)?;
}
ctx.decode(&mut batch)?;

let sampler = LlamaSampler::chain_simple([LlamaSampler::greedy()]);
// ... decode loop

Quantization

The llama_cpp_4::quantize module provides a fully typed Rust API for all quantization options.

use llama_cpp_4::quantize::{GgmlType, LlamaFtype, QuantizeParams, TensorTypeOverride};

// Basic — quantize to Q4_K_M
let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
    .with_nthread(8)
    .with_quantize_output_tensor(true);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", &params).unwrap();

// Advanced — keep output tensor in F16, prune layers 28-31
let params = QuantizeParams::new(LlamaFtype::MostlyQ5KM)
    .with_tensor_type_override(TensorTypeOverride::new("output", GgmlType::F16).unwrap())
    .with_pruned_layers(28..=31);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q5km-pruned.gguf", &params).unwrap();

From the CLI:

# List all available quantization types
cargo run -p quantize -- --list-types

# Quantize with auto output name
cargo run -p quantize -- model-f16.gguf Q4_K_M

# Override a specific tensor type
cargo run -p quantize -- --tensor-type output=F16 model-f16.gguf Q5_K_M

# Dry-run: show size without writing
cargo run -p quantize -- --dry-run model-f16.gguf Q4_K_M

TurboQuant — attention rotation

TurboQuant (llama.cpp PR #21038) applies a Hadamard rotation to the Q, K, and V tensors before they are stored in the KV cache.

Why it matters

Attention activations have large outlier values on some dimensions that make quantization hard. The rotation spreads these outliers evenly so the KV cache can be stored in aggressive formats (Q4_0, Q5_0) with drastically less quality loss:

KV cache type Without TurboQuant With TurboQuant VRAM vs F16
F16 (baseline) 100%
Q8_0 +0.003 PPL +0.003 PPL 53%
Q5_1 +61.70 PPL +0.44 PPL 37%
Q5_0 +17.28 PPL +0.55 PPL 34%
Q4_1 +212.5 PPL +8.65 PPL 31%
Q4_0 +62.02 PPL +32.6 PPL 28%

PPL delta vs F16 baseline on Qwen3 0.6B BF16 — source: llama.cpp PR #21038.

Measured KV-cache space savings

Numbers below come from a benchmark run against Qwen2.5-0.5B-Instruct (24 layers, 2 KV heads, 64 head-dim), obtained by calling ggml_row_size() directly against the compiled GGML library in this repo's build tree.

Model : Qwen2.5-0.5B-Instruct  (24 layers, 2 KV heads, 64 head-dim)

Config                 B/row  B/elem     KV @2K      KV @32K  Saved@32K  Ratio
--------------------  ------  ------  ---------  ----------  ---------  -----
F16  (baseline)          128  2.0000   24.00 MB   384.00 MB      —       1.00x
Q8_0 + TurboQuant         68  1.0625   12.75 MB   204.00 MB  180.0 MB   1.88x
Q5_1 + TurboQuant         48  0.7500    9.00 MB   144.00 MB  240.0 MB   2.67x
Q5_0 + TurboQuant         44  0.6875    8.25 MB   132.00 MB  252.0 MB   2.91x  ← sweet spot
Q4_1 + TurboQuant         40  0.6250    7.50 MB   120.00 MB  264.0 MB   3.20x
Q4_0 + TurboQuant         36  0.5625    6.75 MB   108.00 MB  276.0 MB   3.56x

The ratios are pure GGML block geometry and scale identically to larger models — for a 7B model (32 layers, 8 KV heads, 128 head-dim) multiply every MB figure by ~85×; the ratios and % savings are the same.

Sweet spot: Q5_0 + TurboQuant

  • 2.91× smaller KV cache than vanilla F16 (saves 252 MB per 32 K context window on the 0.5B model, ~21 GB on a 70B model at 32 K ctx)
  • Only +0.55 PPL delta — essentially indistinguishable from F16 in practice
  • The same Q5_0 without TurboQuant gives +17.28 PPL (noticeably wrong output)
  • Q8_0 is the conservative zero-risk choice (1.88×, near-zero PPL cost)
  • Q4_0 gives maximum compression (3.56×) at the price of measurable but tolerable quality loss with rotation on

Key properties

  • Enabled automatically for any model whose head dimension is a power of two (covers essentially all modern transformers).
  • No GGUF changes required — it is a runtime transform of the KV cache only.
  • Reversible — the rotation is applied before storing and reversed before computing attention, so results are mathematically identical to F16.
  • Controlled via the LLAMA_ATTN_ROT_DISABLE env var — set to 1 to opt out.

Using TurboQuant from Rust

use llama_cpp_4::context::params::LlamaContextParams;
use llama_cpp_4::quantize::GgmlType;

// TurboQuant is ON by default — just set a quantized KV cache type:
let ctx_params = LlamaContextParams::default()
    .with_cache_type_k(GgmlType::Q5_0)   // ~31% of F16 VRAM
    .with_cache_type_v(GgmlType::Q5_0);  // quality ≈ F16 thanks to rotation

let ctx = model.new_context(&backend, ctx_params)?;
// Disable rotation for a single context (e.g. benchmarking baseline):
let ctx_params = LlamaContextParams::default()
    .with_cache_type_k(GgmlType::Q5_0)
    .with_attn_rot_disabled(true);   // ← TurboQuant OFF for this context

let ctx = model.new_context(&backend, ctx_params)?;
// Global process-level toggle (call before creating any context):
use llama_cpp_4::quantize::{attn_rot_disabled, set_attn_rot_disabled};

set_attn_rot_disabled(true);
assert!(attn_rot_disabled());

set_attn_rot_disabled(false); // restore

Live demo

# API reference + PPL table (no model required)
cargo run -p turbo-quant -- --show-api

# Run both passes and compare outputs directly
cargo run -p turbo-quant -- \
    --model model.gguf \
    --kv-type q5_0 \
    --prompt "The capital of France is" \
    --n-predict 16

Incremental prefill

The incremental-chat example demonstrates incremental prefill — decoding prompt tokens into the KV cache while the user is still typing, so that generation starts almost instantly when they press Enter.

Features

  • Incremental prefill — tokens decoded into the KV cache as you type
  • BPE-stable margin — withholds the last 2 tokens to avoid decode→invalidate churn (saves ~55% total compute)
  • Chat template — proper formatting via apply_chat_template
  • Cached history prefix — only the new user message is re-tokenized, not the entire conversation
  • Conversation history — KV cache persisted across turns with sliding-window eviction
  • System prompt — prefilled once at startup, never re-processed
  • Full cursor-based editor — arrow keys, Home/End, insert/delete at any position
  • Multi-line input — Alt+Enter for newlines
  • Line editing — Ctrl+W (word), Ctrl+U (clear), Ctrl+K (kill to end)
  • Editing prefilled text — mid-line edits invalidate only from the divergence point
  • Ctrl-C to cancel generation mid-stream (press twice while typing to quit)
  • Performance stats — TTFT and tok/s displayed after each response
  • Graceful overflow — messages exceeding context are truncated with a warning
  • Stale message draining — only the latest input change is processed
  • Terminal cleanup on panic via a custom panic hook
  • Comprehensive benchmark — 6 dimensions: latency, speed, load, precision, UX, DX

How it works

  1. The system prompt is prefilled once at startup and kept across turns.
  2. As the user types, the current input is periodically tokenized (debounced).
  3. New tokens beyond the KV cache are decoded in small batches, withholding the last 2 tokens to avoid BPE churn (adding a character can change the last 1–2 tokens retroactively — the margin prevents wasted decode cycles).
  4. If the user deletes or changes text, the KV cache is trimmed from the divergence point — only the invalidated suffix is re-processed.
  5. When the user presses Enter, the remaining tokens (including the withheld tail) are flushed and generation begins immediately.
  6. Conversation history stays in the KV cache. When the context fills up, the oldest turns are evicted (sliding window) rather than clearing everything.

Benchmark results

Measured on Qwen2.5-0.5B-Instruct Q4_K_M (Apple Silicon, CPU-only). Run the full benchmark: cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf Generate charts: cargo run -p incremental-chat --bin incremental-charts

1. Latency — normal vs incremental flush at Enter

Perceived speedup

2. Speed — 167 tok/s generation throughput (32 tokens in 191ms)

3. GPU Load — BPE margin saves 40–59% total compute vs naive

4. Precision — incremental prefill produces identical first token to normal prefill (✔ ALL MATCH)

5. UX — mid-line edit recovery cost

6. DX — 3-method API (new/prefill_speculative/flush), pure userspace pattern, ~130 lines of shared code

7. KV Cache Quantization + TurboQuant

Generated 64 tokens and compared output to F16 baseline:

Config Diverges at Quality Output sample
F16 (baseline) "Rust and C++ are both popular programming languages..."
Q8_0 + TurboQuant char 195 near-identical Same as F16 for ~195 chars
Q5_0 + TurboQuant char 24 coherent "...both high-level programming languages..."
Q4_0 + TurboQuant char 24 coherent "...different approaches to memory management..."
Q5_0 no TurboQuant char 2 ⚠ degraded "The following of the following of the of the..."
Q4_0 no TurboQuant char 2 ⚠ degraded "The term 'in terms of memory safety...is a programming language..."

TurboQuant makes quantized KV cache usable. Without it, Q5_0/Q4_0 produce degenerate output (diverges at char 2). With it, Q5_0 produces coherent text that diverges only in wording, while Q8_0 is near-identical to F16. See also the TurboQuant section for PPL and VRAM numbers.

8. Samplers & Temperature

9 sampler configurations tested with seed=42 for reproducibility:

Sampler Gen (48t) tok/s Output style
greedy (t=0) 306 ms 157 Deterministic, factual
temp=0.1 top_k=40 306 ms 157 Nearly identical to greedy
temp=0.4 top_p=0.9 330 ms 145 Slight variation, still focused
temp=0.7 top_p=0.9 335 ms 143 Creative, writes actual haiku
temp=1.0 top_p=0.95 413 ms 116 More diverse, poetic
temp=1.5 top_k=50 307 ms 156 Wild — mixes English and Chinese
min_p=0.05 t=0.7 310 ms 155 Focused, similar to top_p
top_n_sigma=1.0 643 ms 75 Slower (large candidate set)
mirostat_v2 τ=5 565 ms 85 Adaptive, poetic output

Key findings:

  • Same seed produces identical output across runs (deterministic)
  • Greedy/low-temp are fastest (~157 tok/s), mirostat/sigma slowest (~75-85 tok/s)
  • temp=0.7 + top_p=0.9 is the sweet spot for creative tasks
  • min_p is a fast alternative to top_p with similar quality

Usage

# Interactive chat with live prefill
cargo run --release -p incremental-chat -- local model.gguf

# Quantized KV cache with TurboQuant (saves VRAM, near-zero quality loss)
cargo run --release -p incremental-chat -- --kv-type q5_0 local model.gguf

# Without TurboQuant (for comparison)
cargo run --release -p incremental-chat -- --kv-type q5_0 --no-turbo-quant local model.gguf

# Cache the system prompt session to disk (instant restart)
cargo run --release -p incremental-chat -- --session-cache sys.session local model.gguf

# Custom system prompt, debounce, and sliding window
cargo run --release -p incremental-chat -- \
    --system-prompt "You are a pirate. Respond only in pirate speak." \
    --debounce-ms 100 --keep-turns 4 \
    local model.gguf

# Run the comprehensive benchmark (7 dimensions)
cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf

Using the incremental prefill API

The key building blocks from llama-cpp-4:

use llama_cpp_4::llama_batch::LlamaBatch;
use llama_cpp_4::token::LlamaToken;

// Decode only new tokens (the delta) into the KV cache.
// Withhold the last 2 tokens — BPE can retroactively change them
// when the next character is typed.
let new_tokens = model.str_to_token(&user_text, AddBos::Always)?;
let stable_end = new_tokens.len().saturating_sub(2); // BPE margin
let common = find_common_prefix(&cached_tokens, &new_tokens[..stable_end]);

// Trim cache if the user edited earlier text
if common < cached_tokens.len() {
    ctx.clear_kv_cache_seq(Some(0), Some(common as u32), None)?;
}

// Decode only the genuinely new stable tokens
let mut batch = LlamaBatch::new(512, 1);
for (i, &token) in new_tokens[common..stable_end].iter().enumerate() {
    let pos = (common + i) as i32;
    batch.add(token, pos, &[0], i == stable_end - common - 1)?;
}
ctx.decode(&mut batch)?;

// When user presses Enter: flush ALL tokens (including the tail)

GPU acceleration

Feature Hardware Flag
cuda NVIDIA (CUDA) --features cuda
metal Apple Silicon --features metal
vulkan AMD / Intel / cross-platform --features vulkan
native CPU with AVX2/NEON auto-detect --features native
openmp Multi-core CPU (default on) --features openmp
rpc Remote compute backend --features rpc
# Metal (macOS)
cargo run -p openai-server --features metal -- --n-gpu-layers 99 \
    local model.gguf

# CUDA (Linux/Windows)
cargo run -p openai-server --features cuda -- --n-gpu-layers 99 \
    local model.gguf

# Vulkan (cross-platform)
cargo run -p openai-server --features vulkan -- --n-gpu-layers 99 \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

Hugging Face model download

All examples and the server accept a hf-model <repo> [quant] subcommand that downloads models from the Hub (cached in ~/.cache/huggingface/).

# Interactive quant picker for repos with many options
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF

# Select by quant name (downloads all shards automatically)
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF Q4_K_M

# Exact filename
cargo run -p openai-server -- \
    hf-model TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf

Set HUGGING_FACE_HUB_TOKEN for gated models.


Development

# Clone with submodules (llama.cpp is a submodule of llama-cpp-sys-4)
git clone --recursive https://github.com/eugenehp/llama-cpp-rs

# Or after cloning without --recursive
git submodule update --init --recursive

# Build everything
cargo build

# Run all unit tests (no model required)
cargo test

# Run server unit tests specifically
cargo test -p openai-server

Updating llama.cpp

cd llama-cpp-sys-4/llama.cpp
git fetch origin master
git checkout origin/master  # or a specific commit/tag
cd ../..
cargo build          # build.rs regenerates bindings automatically

Multimodal Images

Via the OpenAI-compatible server

cargo run -p openai-server --features mtmd --release -- \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Or with an explicit mmproj path:

cargo run -p openai-server --features mtmd -- \
    --mmproj mmproj-BF16.gguf \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Standalone multimodal example

cargo run --features mtmd -p mtmd -- \
    --model /path/to/model.gguf \
    --mmproj /path/to/mmproj.gguf \
    --image /path/to/image.jpg \
    --prompt "Describe this image."

Credits

Originally derived from llama-cpp-2 — thanks to those contributors.
See also bitnet-cpp-rs for highly-quantized BitNet model support.

Citation

@software{hauptmann2025llamacpprs,
  author    = {Hauptmann, Eugene},
  title     = {{llama-cpp-4}: llama-cpp {Rust} wrapper},
  year      = {2025},
  version   = {0.2.18},
  url       = {https://github.com/eugenehp/llama-cpp-rs},
}

License

This project is licensed under the MIT License.

Copyright

© 2025-2026, Eugene Hauptmann

About

A wrapper around the llama-cpp library for rust, including new Sampler API from llama-cpp.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages