Safe Rust bindings to llama.cpp, tracking upstream closely.
| Crate | Description | crates.io |
|---|---|---|
llama-cpp-4 |
Safe high-level API | |
llama-cpp-sys-4 |
Raw bindgen bindings |
llama.cpp version: c30e01225 (April 2026) — includes TurboQuant (PR #21038)
| Package name | Directory | Description |
|---|---|---|
simple |
examples/simple/ |
Single-turn text completion from CLI or Hugging Face |
chat |
examples/chat/ |
Interactive multi-turn chat REPL |
embeddings |
examples/embeddings/ |
Batch embedding with cosine similarity |
split-model-example |
examples/split_model/ |
Load sharded / split GGUF files |
openai-server |
examples/server/ |
OpenAI-compatible HTTP server with streaming and tool calling |
mtmd |
examples/mtmd/ |
Multimodal (vision / audio) inference (requires --features mtmd) |
quantize |
examples/quantize/ |
Quantize a GGUF model with full typed API |
turbo-quant |
examples/turbo-quant/ |
TurboQuant demo — compare attn rotation on/off |
incremental-chat |
examples/incremental-chat/ |
Chat with incremental prefill — processes tokens while you type |
git clone --recursive https://github.com/eugenehp/llama-cpp-rs
cd llama-cpp-rscargo run -p chat -- \
hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf# Starts on http://127.0.0.1:8080
cargo run -p openai-server -- \
hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.ggufllama-cpp-sys-4 can consume precompiled llama/ggml libraries via env vars.
This is useful for CI pipelines that publish native artifacts once and reuse
them in downstream repos (for example, speeding up a separate app build).
# Directory containing prebuilt libs in one of:
# <dir>, <dir>/lib, <dir>/lib64, <dir>/bin
export LLAMA_PREBUILT_DIR=/path/to/prebuilt
# Optional: force dynamic linking mode for prebuilt artifacts.
# Defaults to the crate's normal link mode for the active feature set.
# export LLAMA_PREBUILT_SHARED=1
cargo build -p your-app --features "q1,vulkan"Notes:
q1compatibility is determined by the prebuilt artifact itself — publish separate artifacts per feature/backend tuple (q1+vulkan,q1+metal, ...).build.rsstill generates Rust bindings, but skips the expensive CMake compile whenLLAMA_PREBUILT_DIRis set.
Backend feature coverage (practical targets):
metal→ macOS (Apple Silicon and Intel Macs)vulkan→ Linux/Windows (cross-vendor desktop GPUs)webgpu→ Linux/Windows (experimental; requires Dawn/WebGPU-native stack)cuda→ Linux/Windows with NVIDIA CUDA toolkit (experimental in CI)hip→ Linux ROCm/HIP environments (experimental in CI)opencl→ Linux/Windows with OpenCL SDK/runtime (experimental in CI)blas→ CPU acceleration (Linux/macOS/Windows)
# Chat completion
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}], "max_tokens":128}'
# Streaming
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Count to 5"}], "stream":true}'
# Embeddings
curl http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": ["Hello world", "Bonjour le monde"]}'use llama_cpp_4::{
llama_backend::LlamaBackend,
llama_batch::LlamaBatch,
model::{params::LlamaModelParams, AddBos, LlamaModel, Special},
context::params::LlamaContextParams,
sampling::LlamaSampler,
};
use std::num::NonZeroU32;
let backend = LlamaBackend::init()?;
let model = LlamaModel::load_from_file(&backend, "model.gguf", &LlamaModelParams::default())?;
let mut ctx = model.new_context(&backend, LlamaContextParams::default())?;
let tokens = model.str_to_token("Hello, world!", AddBos::Always)?;
let mut batch = LlamaBatch::new(512, 1);
for (i, &tok) in tokens.iter().enumerate() {
batch.add(tok, i as i32, &[0], i == tokens.len() - 1)?;
}
ctx.decode(&mut batch)?;
let sampler = LlamaSampler::chain_simple([LlamaSampler::greedy()]);
// ... decode loopThe llama_cpp_4::quantize module provides a fully typed Rust API for all
quantization options.
use llama_cpp_4::quantize::{GgmlType, LlamaFtype, QuantizeParams, TensorTypeOverride};
// Basic — quantize to Q4_K_M
let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
.with_nthread(8)
.with_quantize_output_tensor(true);
llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", ¶ms).unwrap();
// Advanced — keep output tensor in F16, prune layers 28-31
let params = QuantizeParams::new(LlamaFtype::MostlyQ5KM)
.with_tensor_type_override(TensorTypeOverride::new("output", GgmlType::F16).unwrap())
.with_pruned_layers(28..=31);
llama_cpp_4::model_quantize("model-f16.gguf", "model-q5km-pruned.gguf", ¶ms).unwrap();From the CLI:
# List all available quantization types
cargo run -p quantize -- --list-types
# Quantize with auto output name
cargo run -p quantize -- model-f16.gguf Q4_K_M
# Override a specific tensor type
cargo run -p quantize -- --tensor-type output=F16 model-f16.gguf Q5_K_M
# Dry-run: show size without writing
cargo run -p quantize -- --dry-run model-f16.gguf Q4_K_MTurboQuant (llama.cpp PR #21038) applies a Hadamard rotation to the Q, K, and V tensors before they are stored in the KV cache.
Attention activations have large outlier values on some dimensions that make quantization hard. The rotation spreads these outliers evenly so the KV cache can be stored in aggressive formats (Q4_0, Q5_0) with drastically less quality loss:
| KV cache type | Without TurboQuant | With TurboQuant | VRAM vs F16 |
|---|---|---|---|
| F16 (baseline) | — | — | 100% |
| Q8_0 | +0.003 PPL | +0.003 PPL | 53% |
| Q5_1 | +61.70 PPL | +0.44 PPL | 37% |
| Q5_0 | +17.28 PPL | +0.55 PPL | 34% |
| Q4_1 | +212.5 PPL | +8.65 PPL | 31% |
| Q4_0 | +62.02 PPL | +32.6 PPL | 28% |
PPL delta vs F16 baseline on Qwen3 0.6B BF16 — source: llama.cpp PR #21038.
Numbers below come from a benchmark run against Qwen2.5-0.5B-Instruct
(24 layers, 2 KV heads, 64 head-dim), obtained by calling ggml_row_size()
directly against the compiled GGML library in this repo's build tree.
Model : Qwen2.5-0.5B-Instruct (24 layers, 2 KV heads, 64 head-dim)
Config B/row B/elem KV @2K KV @32K Saved@32K Ratio
-------------------- ------ ------ --------- ---------- --------- -----
F16 (baseline) 128 2.0000 24.00 MB 384.00 MB — 1.00x
Q8_0 + TurboQuant 68 1.0625 12.75 MB 204.00 MB 180.0 MB 1.88x
Q5_1 + TurboQuant 48 0.7500 9.00 MB 144.00 MB 240.0 MB 2.67x
Q5_0 + TurboQuant 44 0.6875 8.25 MB 132.00 MB 252.0 MB 2.91x ← sweet spot
Q4_1 + TurboQuant 40 0.6250 7.50 MB 120.00 MB 264.0 MB 3.20x
Q4_0 + TurboQuant 36 0.5625 6.75 MB 108.00 MB 276.0 MB 3.56x
The ratios are pure GGML block geometry and scale identically to larger models — for a 7B model (32 layers, 8 KV heads, 128 head-dim) multiply every MB figure by ~85×; the ratios and % savings are the same.
- 2.91× smaller KV cache than vanilla F16 (saves 252 MB per 32 K context window on the 0.5B model, ~21 GB on a 70B model at 32 K ctx)
- Only +0.55 PPL delta — essentially indistinguishable from F16 in practice
- The same Q5_0 without TurboQuant gives +17.28 PPL (noticeably wrong output)
- Q8_0 is the conservative zero-risk choice (1.88×, near-zero PPL cost)
- Q4_0 gives maximum compression (3.56×) at the price of measurable but tolerable quality loss with rotation on
- Enabled automatically for any model whose head dimension is a power of two (covers essentially all modern transformers).
- No GGUF changes required — it is a runtime transform of the KV cache only.
- Reversible — the rotation is applied before storing and reversed before computing attention, so results are mathematically identical to F16.
- Controlled via the
LLAMA_ATTN_ROT_DISABLEenv var — set to1to opt out.
use llama_cpp_4::context::params::LlamaContextParams;
use llama_cpp_4::quantize::GgmlType;
// TurboQuant is ON by default — just set a quantized KV cache type:
let ctx_params = LlamaContextParams::default()
.with_cache_type_k(GgmlType::Q5_0) // ~31% of F16 VRAM
.with_cache_type_v(GgmlType::Q5_0); // quality ≈ F16 thanks to rotation
let ctx = model.new_context(&backend, ctx_params)?;// Disable rotation for a single context (e.g. benchmarking baseline):
let ctx_params = LlamaContextParams::default()
.with_cache_type_k(GgmlType::Q5_0)
.with_attn_rot_disabled(true); // ← TurboQuant OFF for this context
let ctx = model.new_context(&backend, ctx_params)?;// Global process-level toggle (call before creating any context):
use llama_cpp_4::quantize::{attn_rot_disabled, set_attn_rot_disabled};
set_attn_rot_disabled(true);
assert!(attn_rot_disabled());
set_attn_rot_disabled(false); // restore# API reference + PPL table (no model required)
cargo run -p turbo-quant -- --show-api
# Run both passes and compare outputs directly
cargo run -p turbo-quant -- \
--model model.gguf \
--kv-type q5_0 \
--prompt "The capital of France is" \
--n-predict 16The incremental-chat example demonstrates incremental prefill — decoding
prompt tokens into the KV cache while the user is still typing, so that
generation starts almost instantly when they press Enter.
- Incremental prefill — tokens decoded into the KV cache as you type
- BPE-stable margin — withholds the last 2 tokens to avoid decode→invalidate churn (saves ~55% total compute)
- Chat template — proper formatting via
apply_chat_template - Cached history prefix — only the new user message is re-tokenized, not the entire conversation
- Conversation history — KV cache persisted across turns with sliding-window eviction
- System prompt — prefilled once at startup, never re-processed
- Full cursor-based editor — arrow keys, Home/End, insert/delete at any position
- Multi-line input — Alt+Enter for newlines
- Line editing — Ctrl+W (word), Ctrl+U (clear), Ctrl+K (kill to end)
- Editing prefilled text — mid-line edits invalidate only from the divergence point
- Ctrl-C to cancel generation mid-stream (press twice while typing to quit)
- Performance stats — TTFT and tok/s displayed after each response
- Graceful overflow — messages exceeding context are truncated with a warning
- Stale message draining — only the latest input change is processed
- Terminal cleanup on panic via a custom panic hook
- Comprehensive benchmark — 6 dimensions: latency, speed, load, precision, UX, DX
- The system prompt is prefilled once at startup and kept across turns.
- As the user types, the current input is periodically tokenized (debounced).
- New tokens beyond the KV cache are decoded in small batches, withholding the last 2 tokens to avoid BPE churn (adding a character can change the last 1–2 tokens retroactively — the margin prevents wasted decode cycles).
- If the user deletes or changes text, the KV cache is trimmed from the divergence point — only the invalidated suffix is re-processed.
- When the user presses Enter, the remaining tokens (including the withheld tail) are flushed and generation begins immediately.
- Conversation history stays in the KV cache. When the context fills up, the oldest turns are evicted (sliding window) rather than clearing everything.
Measured on Qwen2.5-0.5B-Instruct Q4_K_M (Apple Silicon, CPU-only).
Run the full benchmark: cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf
Generate charts: cargo run -p incremental-chat --bin incremental-charts
2. Speed — 167 tok/s generation throughput (32 tokens in 191ms)
4. Precision — incremental prefill produces identical first token to normal prefill (✔ ALL MATCH)
6. DX — 3-method API (new/prefill_speculative/flush), pure userspace pattern, ~130 lines of shared code
Generated 64 tokens and compared output to F16 baseline:
| Config | Diverges at | Quality | Output sample |
|---|---|---|---|
| F16 (baseline) | — | — | "Rust and C++ are both popular programming languages..." |
| Q8_0 + TurboQuant | char 195 | near-identical | Same as F16 for ~195 chars |
| Q5_0 + TurboQuant | char 24 | coherent | "...both high-level programming languages..." |
| Q4_0 + TurboQuant | char 24 | coherent | "...different approaches to memory management..." |
| Q5_0 no TurboQuant | char 2 | ⚠ degraded | "The following of the following of the of the..." |
| Q4_0 no TurboQuant | char 2 | ⚠ degraded | "The term 'in terms of memory safety...is a programming language..." |
TurboQuant makes quantized KV cache usable. Without it, Q5_0/Q4_0 produce degenerate output (diverges at char 2). With it, Q5_0 produces coherent text that diverges only in wording, while Q8_0 is near-identical to F16. See also the TurboQuant section for PPL and VRAM numbers.
9 sampler configurations tested with seed=42 for reproducibility:
| Sampler | Gen (48t) | tok/s | Output style |
|---|---|---|---|
| greedy (t=0) | 306 ms | 157 | Deterministic, factual |
| temp=0.1 top_k=40 | 306 ms | 157 | Nearly identical to greedy |
| temp=0.4 top_p=0.9 | 330 ms | 145 | Slight variation, still focused |
| temp=0.7 top_p=0.9 | 335 ms | 143 | Creative, writes actual haiku |
| temp=1.0 top_p=0.95 | 413 ms | 116 | More diverse, poetic |
| temp=1.5 top_k=50 | 307 ms | 156 | Wild — mixes English and Chinese |
| min_p=0.05 t=0.7 | 310 ms | 155 | Focused, similar to top_p |
| top_n_sigma=1.0 | 643 ms | 75 | Slower (large candidate set) |
| mirostat_v2 τ=5 | 565 ms | 85 | Adaptive, poetic output |
Key findings:
- Same seed produces identical output across runs (deterministic)
- Greedy/low-temp are fastest (~157 tok/s), mirostat/sigma slowest (~75-85 tok/s)
temp=0.7 + top_p=0.9is the sweet spot for creative tasksmin_pis a fast alternative totop_pwith similar quality
# Interactive chat with live prefill
cargo run --release -p incremental-chat -- local model.gguf
# Quantized KV cache with TurboQuant (saves VRAM, near-zero quality loss)
cargo run --release -p incremental-chat -- --kv-type q5_0 local model.gguf
# Without TurboQuant (for comparison)
cargo run --release -p incremental-chat -- --kv-type q5_0 --no-turbo-quant local model.gguf
# Cache the system prompt session to disk (instant restart)
cargo run --release -p incremental-chat -- --session-cache sys.session local model.gguf
# Custom system prompt, debounce, and sliding window
cargo run --release -p incremental-chat -- \
--system-prompt "You are a pirate. Respond only in pirate speak." \
--debounce-ms 100 --keep-turns 4 \
local model.gguf
# Run the comprehensive benchmark (7 dimensions)
cargo run --release -p incremental-chat --bin incremental-bench -- model.ggufThe key building blocks from llama-cpp-4:
use llama_cpp_4::llama_batch::LlamaBatch;
use llama_cpp_4::token::LlamaToken;
// Decode only new tokens (the delta) into the KV cache.
// Withhold the last 2 tokens — BPE can retroactively change them
// when the next character is typed.
let new_tokens = model.str_to_token(&user_text, AddBos::Always)?;
let stable_end = new_tokens.len().saturating_sub(2); // BPE margin
let common = find_common_prefix(&cached_tokens, &new_tokens[..stable_end]);
// Trim cache if the user edited earlier text
if common < cached_tokens.len() {
ctx.clear_kv_cache_seq(Some(0), Some(common as u32), None)?;
}
// Decode only the genuinely new stable tokens
let mut batch = LlamaBatch::new(512, 1);
for (i, &token) in new_tokens[common..stable_end].iter().enumerate() {
let pos = (common + i) as i32;
batch.add(token, pos, &[0], i == stable_end - common - 1)?;
}
ctx.decode(&mut batch)?;
// When user presses Enter: flush ALL tokens (including the tail)| Feature | Hardware | Flag |
|---|---|---|
cuda |
NVIDIA (CUDA) | --features cuda |
metal |
Apple Silicon | --features metal |
vulkan |
AMD / Intel / cross-platform | --features vulkan |
native |
CPU with AVX2/NEON auto-detect | --features native |
openmp |
Multi-core CPU (default on) | --features openmp |
rpc |
Remote compute backend | --features rpc |
# Metal (macOS)
cargo run -p openai-server --features metal -- --n-gpu-layers 99 \
local model.gguf
# CUDA (Linux/Windows)
cargo run -p openai-server --features cuda -- --n-gpu-layers 99 \
local model.gguf
# Vulkan (cross-platform)
cargo run -p openai-server --features vulkan -- --n-gpu-layers 99 \
hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.ggufAll examples and the server accept a hf-model <repo> [quant] subcommand
that downloads models from the Hub (cached in ~/.cache/huggingface/).
# Interactive quant picker for repos with many options
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF
# Select by quant name (downloads all shards automatically)
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF Q4_K_M
# Exact filename
cargo run -p openai-server -- \
hf-model TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.ggufSet HUGGING_FACE_HUB_TOKEN for gated models.
# Clone with submodules (llama.cpp is a submodule of llama-cpp-sys-4)
git clone --recursive https://github.com/eugenehp/llama-cpp-rs
# Or after cloning without --recursive
git submodule update --init --recursive
# Build everything
cargo build
# Run all unit tests (no model required)
cargo test
# Run server unit tests specifically
cargo test -p openai-servercd llama-cpp-sys-4/llama.cpp
git fetch origin master
git checkout origin/master # or a specific commit/tag
cd ../..
cargo build # build.rs regenerates bindings automaticallycargo run -p openai-server --features mtmd --release -- \
hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0Or with an explicit mmproj path:
cargo run -p openai-server --features mtmd -- \
--mmproj mmproj-BF16.gguf \
hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0cargo run --features mtmd -p mtmd -- \
--model /path/to/model.gguf \
--mmproj /path/to/mmproj.gguf \
--image /path/to/image.jpg \
--prompt "Describe this image."Originally derived from llama-cpp-2 — thanks to those contributors.
See also bitnet-cpp-rs for highly-quantized BitNet model support.
@software{hauptmann2025llamacpprs,
author = {Hauptmann, Eugene},
title = {{llama-cpp-4}: llama-cpp {Rust} wrapper},
year = {2025},
version = {0.2.18},
url = {https://github.com/eugenehp/llama-cpp-rs},
}This project is licensed under the MIT License.
© 2025-2026, Eugene Hauptmann





