🦙 llama-cpp-rs

Safe Rust bindings to llama.cpp, tracking upstream closely.

Crate	Description	crates.io
`llama-cpp-4`	Safe high-level API
`llama-cpp-sys-4`	Raw bindgen bindings

llama.cpp version: 64b38b561 (May 2026) — includes TurboQuant (PR #21038) and MTP / multi-token-prediction speculative decoding (PR #22673).

Examples

Package name	Directory	Description
`simple`	`examples/simple/`	Single-turn text completion from CLI or Hugging Face
`chat`	`examples/chat/`	Interactive multi-turn chat REPL
`embeddings`	`examples/embeddings/`	Batch embedding with cosine similarity
`split-model-example`	`examples/split_model/`	Load sharded / split GGUF files
`openai-server`	`examples/server/`	OpenAI-compatible HTTP server with streaming and tool calling
`mtmd`	`examples/mtmd/`	Multimodal (vision / audio) inference (requires `--features mtmd`)
`quantize`	`examples/quantize/`	Quantize a GGUF model with full typed API
`turbo-quant`	`examples/turbo-quant/`	TurboQuant demo — compare attn rotation on/off
`incremental-chat`	`examples/incremental-chat/`	Chat with incremental prefill — processes tokens while you type
`mtp`	`examples/mtp/`	MTP speculative decoding via `MtpSession` (`--predict`, `--p-min`, draft loop)

Quick start

git clone --recursive https://github.com/eugenehp/llama-cpp-rs
cd llama-cpp-rs

Interactive chat

cargo run -p chat -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

OpenAI-compatible server

# Starts on http://127.0.0.1:8080
cargo run -p openai-server -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

Using prebuilt native libraries (skip CMake compile)

llama-cpp-sys-4 can consume precompiled llama/ggml libraries via env vars. This is useful for CI pipelines that publish native artifacts once and reuse them in downstream repos (for example, speeding up a separate app build).

# Directory containing prebuilt libs in one of:
#   <dir>, <dir>/lib, <dir>/lib64, <dir>/bin
export LLAMA_PREBUILT_DIR=/path/to/prebuilt

# Optional: force dynamic linking mode for prebuilt artifacts.
# Defaults to the crate's normal link mode for the active feature set.
# export LLAMA_PREBUILT_SHARED=1

cargo build -p your-app --features "q1,vulkan"

Notes:

q1 compatibility is determined by the prebuilt artifact itself — publish separate artifacts per feature/backend tuple (q1+vulkan, q1+metal, ...).
build.rs still generates Rust bindings, but skips the expensive CMake compile when LLAMA_PREBUILT_DIR is set.

Backend feature coverage (practical targets):

metal → macOS (Apple Silicon and Intel Macs)
vulkan → Linux/Windows (cross-vendor desktop GPUs)
webgpu → Linux/Windows (experimental; requires Dawn/WebGPU-native stack)
cuda → Linux/Windows with NVIDIA CUDA toolkit (experimental in CI)
hip → Linux ROCm/HIP environments (experimental in CI)

Prebuilt Feature Benchmark Results

The prebuilt feature flag provides automatic prebuilt artifact management. Benchmark results (Apple Silicon M2, macOS 14.4):

Configuration	Build Type	Time	Improvement
Base (Static)	Debug	11.99s	Baseline
Base + `prebuilt`	Debug	11.01s	8% faster
Dynamic Linking	Debug	26.80s	-123% (slower)
Dynamic + `prebuilt`	Debug	27.47s	-129% (slower)
Base (Static)	Release	26.01s	Baseline
Dynamic Linking	Release	26.79s	-3% (slower)

Key Insights:

✅ Static linking + prebuilt: 8% faster debug builds (11.99s → 11.01s)
✅ Release builds: Minimal difference between static/dynamic
✅ Development workflow: Prebuilt feature provides best iteration speed
🚀 CI/CD potential: When fully implemented with artifact caching, expect 50-80% speedups for complex builds

Usage:

# Enable prebuilt feature for faster development
cargo build --features prebuilt

# Combine with other features
cargo build --features "prebuilt,vulkan"

# Release builds (prebuilt provides minimal benefit)
cargo build --release --features prebuilt

Implementation Status:

✅ Feature flag infrastructure complete
✅ Automatic feature detection and configuration
✅ Safe fallback to local compilation
📋 TODO: Actual artifact download and caching (foundation ready)

When fully implemented, the prebuilt feature will automatically:

Download matching prebuilt artifacts from GitHub releases
Cache them in target/llama-prebuilt-cache/
Use cached artifacts for subsequent builds
Fall back gracefully to local compilation if artifacts unavailable

opencl → Linux/Windows with OpenCL SDK/runtime (experimental in CI)
blas → CPU acceleration (Linux/macOS/Windows)

# Chat completion
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}], "max_tokens":128}'

# Streaming
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Count to 5"}], "stream":true}'

# Embeddings
curl http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["Hello world", "Bonjour le monde"]}'

Text generation (library)

use llama_cpp_4::{
    llama_backend::LlamaBackend,
    llama_batch::LlamaBatch,
    model::{params::LlamaModelParams, AddBos, LlamaModel, Special},
    context::params::LlamaContextParams,
    sampling::LlamaSampler,
};
use std::num::NonZeroU32;

let backend = LlamaBackend::init()?;
let model = LlamaModel::load_from_file(&backend, "model.gguf", &LlamaModelParams::default())?;
let mut ctx = model.new_context(&backend, LlamaContextParams::default())?;

let tokens = model.str_to_token("Hello, world!", AddBos::Always)?;
let mut batch = LlamaBatch::new(512, 1);
for (i, &tok) in tokens.iter().enumerate() {
    batch.add(tok, i as i32, &[0], i == tokens.len() - 1)?;
}
ctx.decode(&mut batch)?;

let sampler = LlamaSampler::chain_simple([LlamaSampler::greedy()]);
// ... decode loop

Quantization

The llama_cpp_4::quantize module provides a fully typed Rust API for all quantization options.

use llama_cpp_4::quantize::{GgmlType, LlamaFtype, QuantizeParams, TensorTypeOverride};

// Basic — quantize to Q4_K_M
let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
    .with_nthread(8)
    .with_quantize_output_tensor(true);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", &params).unwrap();

// Advanced — keep output tensor in F16, prune layers 28-31
let params = QuantizeParams::new(LlamaFtype::MostlyQ5KM)
    .with_tensor_type_override(TensorTypeOverride::new("output", GgmlType::F16).unwrap())
    .with_pruned_layers(28..=31);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q5km-pruned.gguf", &params).unwrap();

From the CLI:

# List all available quantization types
cargo run -p quantize -- --list-types

# Quantize with auto output name
cargo run -p quantize -- model-f16.gguf Q4_K_M

# Override a specific tensor type
cargo run -p quantize -- --tensor-type output=F16 model-f16.gguf Q5_K_M

# Dry-run: show size without writing
cargo run -p quantize -- --dry-run model-f16.gguf Q4_K_M

TurboQuant — attention rotation

TurboQuant (llama.cpp PR #21038) applies a Hadamard rotation to the Q, K, and V tensors before they are stored in the KV cache.

Why it matters

Attention activations have large outlier values on some dimensions that make quantization hard. The rotation spreads these outliers evenly so the KV cache can be stored in aggressive formats (Q4_0, Q5_0) with drastically less quality loss:

KV cache type	Without TurboQuant	With TurboQuant	VRAM vs F16
F16 (baseline)	—	—	100%
Q8_0	+0.003 PPL	+0.003 PPL	53%
Q5_1	+61.70 PPL	+0.44 PPL	37%
Q5_0	+17.28 PPL	+0.55 PPL	34%
Q4_1	+212.5 PPL	+8.65 PPL	31%
Q4_0	+62.02 PPL	+32.6 PPL	28%

PPL delta vs F16 baseline on Qwen3 0.6B BF16 — source: llama.cpp PR #21038.

Measured KV-cache space savings

Numbers below come from a benchmark run against Qwen2.5-0.5B-Instruct (24 layers, 2 KV heads, 64 head-dim), obtained by calling ggml_row_size() directly against the compiled GGML library in this repo's build tree.

Model : Qwen2.5-0.5B-Instruct  (24 layers, 2 KV heads, 64 head-dim)

Config                 B/row  B/elem     KV @2K      KV @32K  Saved@32K  Ratio
--------------------  ------  ------  ---------  ----------  ---------  -----
F16  (baseline)          128  2.0000   24.00 MB   384.00 MB      —       1.00x
Q8_0 + TurboQuant         68  1.0625   12.75 MB   204.00 MB  180.0 MB   1.88x
Q5_1 + TurboQuant         48  0.7500    9.00 MB   144.00 MB  240.0 MB   2.67x
Q5_0 + TurboQuant         44  0.6875    8.25 MB   132.00 MB  252.0 MB   2.91x  ← sweet spot
Q4_1 + TurboQuant         40  0.6250    7.50 MB   120.00 MB  264.0 MB   3.20x
Q4_0 + TurboQuant         36  0.5625    6.75 MB   108.00 MB  276.0 MB   3.56x

The ratios are pure GGML block geometry and scale identically to larger models — for a 7B model (32 layers, 8 KV heads, 128 head-dim) multiply every MB figure by ~85×; the ratios and % savings are the same.

Sweet spot: Q5_0 + TurboQuant

2.91× smaller KV cache than vanilla F16 (saves 252 MB per 32 K context window on the 0.5B model, ~21 GB on a 70B model at 32 K ctx)
Only +0.55 PPL delta — essentially indistinguishable from F16 in practice
The same Q5_0 without TurboQuant gives +17.28 PPL (noticeably wrong output)
Q8_0 is the conservative zero-risk choice (1.88×, near-zero PPL cost)
Q4_0 gives maximum compression (3.56×) at the price of measurable but tolerable quality loss with rotation on

Key properties

Enabled automatically for any model whose head dimension is a power of two (covers essentially all modern transformers).
No GGUF changes required — it is a runtime transform of the KV cache only.
Reversible — the rotation is applied before storing and reversed before computing attention, so results are mathematically identical to F16.
Controlled via the LLAMA_ATTN_ROT_DISABLE env var — set to 1 to opt out.

Using TurboQuant from Rust

use llama_cpp_4::context::params::LlamaContextParams;
use llama_cpp_4::quantize::GgmlType;

// TurboQuant is ON by default — just set a quantized KV cache type:
let ctx_params = LlamaContextParams::default()
    .with_cache_type_k(GgmlType::Q5_0)   // ~31% of F16 VRAM
    .with_cache_type_v(GgmlType::Q5_0);  // quality ≈ F16 thanks to rotation

let ctx = model.new_context(&backend, ctx_params)?;

// Disable rotation for a single context (e.g. benchmarking baseline):
let ctx_params = LlamaContextParams::default()
    .with_cache_type_k(GgmlType::Q5_0)
    .with_attn_rot_disabled(true);   // ← TurboQuant OFF for this context

let ctx = model.new_context(&backend, ctx_params)?;

// Global process-level toggle (call before creating any context):
use llama_cpp_4::quantize::{attn_rot_disabled, set_attn_rot_disabled};

set_attn_rot_disabled(true);
assert!(attn_rot_disabled());

set_attn_rot_disabled(false); // restore

Live demo

# API reference + PPL table (no model required)
cargo run -p turbo-quant -- --show-api

# Run both passes and compare outputs directly
cargo run -p turbo-quant -- \
    --model model.gguf \
    --kv-type q5_0 \
    --prompt "The capital of France is" \
    --n-predict 16

MTP — multi-token-prediction speculative decoding

Upstream PR #22673 added MTP draft heads to llama.cpp. The Rust API lives in llama_cpp_4::mtp: build a target + draft context pair, wrap them in MtpSession, and drive the verify/accept loop from Rust.

1. Context setup

Both contexts come from the same MTP-capable GGUF. The draft context must use LlamaContextType::Mtp and n_rs_seq >= n_draft_max (rollback snapshots for speculative verification):

use llama_cpp_4::context::params::{LlamaContextParams, LlamaContextType};

let n_draft_max = 3;

let target = model.new_context(&backend, LlamaContextParams::default())?;
let draft = model.new_context(
    &backend,
    LlamaContextParams::default()
        .with_ctx_type(LlamaContextType::Mtp)
        .with_n_rs_seq(n_draft_max.max(4)), // headroom for rollback
)?;

2. Session config and creation

MtpSessionConfig maps to upstream common_params_speculative_draft:

Field	Meaning	Typical value
`n_seq`	Parallel sequences	`1`
`n_draft_max`	Max tokens drafted per round	`1`–`3` (model-dependent)
`p_min`	Drop draft tokens below this probability	`0.0` (upstream default since #23269)
`n_min`	Minimum drafts to propose	`0`

use llama_cpp_4::mtp::{MtpSession, MtpSessionConfig};

// Shorthand (defaults: n_min=0, p_min=0.0)
let mut session = MtpSession::new(&target, &draft, 1, n_draft_max)?;

// Full control
let config = MtpSessionConfig::new(1, n_draft_max)
    .with_p_min(0.0)
    .with_n_min(0);
let mut session = MtpSession::new_with_config(&target, &draft, config)?;

assert!(session.need_embd_pre_norm()); // MTP needs pre-norm hidden states
assert!(!session.need_embd());         // post-norm embeddings not used

Upstream configures pre-norm extraction on both contexts during session init; you normally do not need to call LlamaContext::set_embeddings_pre_norm yourself.

3. Speculative decode loop (outline)

After every target.decode(batch):

session.process(&batch)? — sync MTP with the target batch
session.draft(seq_id, n_past, last_token)? — propose draft tokens
Verify drafts on the target (your sampler / argmax logic)
session.accept(seq_id, n_accepted)? — update draft recurrent state
session.print_stats() — log upstream draft/accept counters (optional)

See examples/mtp/src/main.rs for a complete working loop with timing and acceptance reporting.

4. CLI examples

Smoke test (build contexts only):

cargo run --release -p mtp --features metal -- \
    hf-model froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-IQ2_M-mtp.gguf

Full generation with draft tuning:

cargo run --release -p mtp --features metal -- \
    --predict 64 \
    --n-draft-max 1 \
    --p-min 0.0 \
    --prompt "The capital of France is" \
    hf-model froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-IQ2_M-mtp.gguf

Use --features cuda or --features vulkan on other platforms instead of metal.

5. Benchmarks and tuning

Draft depth is quant- and model-sensitive. See MTP.md for measured throughput on Apple Silicon and notes on upstream #23269 sampling changes.

For comparison against upstream llama-server --spec-type draft-mtp, use scripts/bench-mtp.sh.

Incremental prefill

The incremental-chat example demonstrates incremental prefill — decoding prompt tokens into the KV cache while the user is still typing, so that generation starts almost instantly when they press Enter.

Features

Incremental prefill — tokens decoded into the KV cache as you type
BPE-stable margin — withholds the last 2 tokens to avoid decode→invalidate churn (saves ~55% total compute)
Chat template — proper formatting via apply_chat_template
Cached history prefix — only the new user message is re-tokenized, not the entire conversation
Conversation history — KV cache persisted across turns with sliding-window eviction
System prompt — prefilled once at startup, never re-processed
Full cursor-based editor — arrow keys, Home/End, insert/delete at any position
Multi-line input — Alt+Enter for newlines
Line editing — Ctrl+W (word), Ctrl+U (clear), Ctrl+K (kill to end)
Editing prefilled text — mid-line edits invalidate only from the divergence point
Ctrl-C to cancel generation mid-stream (press twice while typing to quit)
Performance stats — TTFT and tok/s displayed after each response
Graceful overflow — messages exceeding context are truncated with a warning
Stale message draining — only the latest input change is processed
Terminal cleanup on panic via a custom panic hook
Comprehensive benchmark — 6 dimensions: latency, speed, load, precision, UX, DX

How it works

The system prompt is prefilled once at startup and kept across turns.
As the user types, the current input is periodically tokenized (debounced).
New tokens beyond the KV cache are decoded in small batches, withholding the last 2 tokens to avoid BPE churn (adding a character can change the last 1–2 tokens retroactively — the margin prevents wasted decode cycles).
If the user deletes or changes text, the KV cache is trimmed from the divergence point — only the invalidated suffix is re-processed.
When the user presses Enter, the remaining tokens (including the withheld tail) are flushed and generation begins immediately.
Conversation history stays in the KV cache. When the context fills up, the oldest turns are evicted (sliding window) rather than clearing everything.

Benchmark results

Measured on Qwen2.5-0.5B-Instruct Q4_K_M (Apple Silicon, CPU-only). Run the full benchmark: cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf Generate charts: cargo run -p incremental-chat --bin incremental-charts

1. Latency — normal vs incremental flush at Enter

Perceived speedup

2. Speed — 167 tok/s generation throughput (32 tokens in 191ms)

3. GPU Load — BPE margin saves 40–59% total compute vs naive

4. Precision — incremental prefill produces identical first token to normal prefill (✔ ALL MATCH)

5. UX — mid-line edit recovery cost

6. DX — 3-method API (new/prefill_speculative/flush), pure userspace pattern, ~130 lines of shared code

7. KV Cache Quantization + TurboQuant

Generated 64 tokens and compared output to F16 baseline:

Config	Diverges at	Quality	Output sample
F16 (baseline)	—	—	"Rust and C++ are both popular programming languages..."
Q8_0 + TurboQuant	char 195	near-identical	Same as F16 for ~195 chars
Q5_0 + TurboQuant	char 24	coherent	"...both high-level programming languages..."
Q4_0 + TurboQuant	char 24	coherent	"...different approaches to memory management..."
Q5_0 no TurboQuant	char 2	⚠ degraded	"The following of the following of the of the..."
Q4_0 no TurboQuant	char 2	⚠ degraded	"The term 'in terms of memory safety...is a programming language..."

TurboQuant makes quantized KV cache usable. Without it, Q5_0/Q4_0 produce degenerate output (diverges at char 2). With it, Q5_0 produces coherent text that diverges only in wording, while Q8_0 is near-identical to F16. See also the TurboQuant section for PPL and VRAM numbers.

8. Samplers & Temperature

9 sampler configurations tested with seed=42 for reproducibility:

Sampler	Gen (48t)	tok/s	Output style
greedy (t=0)	306 ms	157	Deterministic, factual
temp=0.1 top_k=40	306 ms	157	Nearly identical to greedy
temp=0.4 top_p=0.9	330 ms	145	Slight variation, still focused
temp=0.7 top_p=0.9	335 ms	143	Creative, writes actual haiku
temp=1.0 top_p=0.95	413 ms	116	More diverse, poetic
temp=1.5 top_k=50	307 ms	156	Wild — mixes English and Chinese
min_p=0.05 t=0.7	310 ms	155	Focused, similar to top_p
top_n_sigma=1.0	643 ms	75	Slower (large candidate set)
mirostat_v2 τ=5	565 ms	85	Adaptive, poetic output

Key findings:

Same seed produces identical output across runs (deterministic)
Greedy/low-temp are fastest (~157 tok/s), mirostat/sigma slowest (~75-85 tok/s)
temp=0.7 + top_p=0.9 is the sweet spot for creative tasks
min_p is a fast alternative to top_p with similar quality

Usage

# Interactive chat with live prefill
cargo run --release -p incremental-chat -- local model.gguf

# Quantized KV cache with TurboQuant (saves VRAM, near-zero quality loss)
cargo run --release -p incremental-chat -- --kv-type q5_0 local model.gguf

# Without TurboQuant (for comparison)
cargo run --release -p incremental-chat -- --kv-type q5_0 --no-turbo-quant local model.gguf

# Cache the system prompt session to disk (instant restart)
cargo run --release -p incremental-chat -- --session-cache sys.session local model.gguf

# Custom system prompt, debounce, and sliding window
cargo run --release -p incremental-chat -- \
    --system-prompt "You are a pirate. Respond only in pirate speak." \
    --debounce-ms 100 --keep-turns 4 \
    local model.gguf

# Run the comprehensive benchmark (7 dimensions)
cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf

Using the incremental prefill API

The key building blocks from llama-cpp-4:

use llama_cpp_4::llama_batch::LlamaBatch;
use llama_cpp_4::token::LlamaToken;

// Decode only new tokens (the delta) into the KV cache.
// Withhold the last 2 tokens — BPE can retroactively change them
// when the next character is typed.
let new_tokens = model.str_to_token(&user_text, AddBos::Always)?;
let stable_end = new_tokens.len().saturating_sub(2); // BPE margin
let common = find_common_prefix(&cached_tokens, &new_tokens[..stable_end]);

// Trim cache if the user edited earlier text
if common < cached_tokens.len() {
    ctx.clear_kv_cache_seq(Some(0), Some(common as u32), None)?;
}

// Decode only the genuinely new stable tokens
let mut batch = LlamaBatch::new(512, 1);
for (i, &token) in new_tokens[common..stable_end].iter().enumerate() {
    let pos = (common + i) as i32;
    batch.add(token, pos, &[0], i == stable_end - common - 1)?;
}
ctx.decode(&mut batch)?;

// When user presses Enter: flush ALL tokens (including the tail)

GPU acceleration

Feature	Hardware	Flag
`cuda`	NVIDIA (CUDA)	`--features cuda`
`metal`	Apple Silicon	`--features metal`
`vulkan`	AMD / Intel / cross-platform	`--features vulkan`
`native`	CPU with AVX2/NEON auto-detect	`--features native`
`openmp`	Multi-core CPU (default on)	`--features openmp`
`rpc`	Remote compute backend	`--features rpc`
`prebuilt`	All (build optimization)	`--features prebuilt`

# Metal (macOS)
cargo run -p openai-server --features metal -- --n-gpu-layers 99 \
    local model.gguf

# CUDA (Linux/Windows)
cargo run -p openai-server --features cuda -- --n-gpu-layers 99 \
    local model.gguf

# Vulkan (cross-platform)
cargo run -p openai-server --features vulkan -- --n-gpu-layers 99 \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

Hugging Face model download

All examples and the server accept a hf-model <repo> [quant] subcommand that downloads models from the Hub (cached in ~/.cache/huggingface/).

# Interactive quant picker for repos with many options
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF

# Select by quant name (downloads all shards automatically)
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF Q4_K_M

# Exact filename
cargo run -p openai-server -- \
    hf-model TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf

Set HUGGING_FACE_HUB_TOKEN for gated models.

Development

# Clone with submodules (llama.cpp is a submodule of llama-cpp-sys-4)
git clone --recursive https://github.com/eugenehp/llama-cpp-rs

# Or after cloning without --recursive
git submodule update --init --recursive

# Build everything (with optimizations)
cargo build

# Build with prebuilt artifacts for faster compilation
cargo build --features prebuilt

# Run all unit tests (no model required)
cargo test

# Run server unit tests specifically
cargo test -p openai-server

Build Optimizations

The build system includes several optimizations for faster compilation:

Ninja build system (2-3x faster than Make)
Parallel compilation (uses all CPU cores)
sccache compilation caching (makes feature changes instant)
Shared CMake cache (avoids rebuilds when toggling features)
Unity Build (groups source files for faster compilation)
mold linker (5-10x faster linking on Linux)
Prebuilt artifacts (--features prebuilt) (8% faster debug builds, 50-80% expected for CI/CD)

For best performance, install the recommended tools:

# macOS
brew install ninja sccache

# Linux (Ubuntu/Debian)
sudo apt-get install ninja-build mold
cargo install sccache

# Enable detailed build logging
BUILD_DEBUG=1 cargo build

See BUILD_OPTIMIZATIONS.md for more details.

Updating llama.cpp

cd llama-cpp-sys-4/llama.cpp
git fetch origin master
git checkout origin/master  # or a specific commit/tag
cd ../..
cargo build          # build.rs regenerates bindings automatically

Multimodal Images

Via the OpenAI-compatible server

cargo run -p openai-server --features mtmd --release -- \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Or with an explicit mmproj path:

cargo run -p openai-server --features mtmd -- \
    --mmproj mmproj-BF16.gguf \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Standalone multimodal example

cargo run --features mtmd -p mtmd -- \
    --model /path/to/model.gguf \
    --mmproj /path/to/mmproj.gguf \
    --image /path/to/image.jpg \
    --prompt "Describe this image."

Credits

Originally derived from llama-cpp-2 — thanks to those contributors.
See also bitnet-cpp-rs for highly-quantized BitNet model support.

Citation

@software{hauptmann2025llamacpprs,
  author    = {Hauptmann, Eugene},
  title     = {{llama-cpp-4}: llama-cpp {Rust} wrapper},
  year      = {2025},
  version   = {0.3.0},
  url       = {https://github.com/eugenehp/llama-cpp-rs},
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,440 Commits
.github		.github
.vscode		.vscode
charts		charts
examples		examples
llama-cpp-4		llama-cpp-4
llama-cpp-sys-4		llama-cpp-sys-4
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
LINUX.md		LINUX.md
MAC.md		MAC.md
MTP.md		MTP.md
README.md		README.md
WINDOWS.md		WINDOWS.md
test-build.Dockerfile		test-build.Dockerfile

Folders and files

Latest commit

History

Repository files navigation

🦙 llama-cpp-rs

Examples

Quick start

Interactive chat

OpenAI-compatible server

Using prebuilt native libraries (skip CMake compile)

Prebuilt Feature Benchmark Results

Text generation (library)

Quantization

TurboQuant — attention rotation

Why it matters

Measured KV-cache space savings

Sweet spot: Q5_0 + TurboQuant

Key properties

Using TurboQuant from Rust

Live demo

MTP — multi-token-prediction speculative decoding

1. Context setup

2. Session config and creation

3. Speculative decode loop (outline)

4. CLI examples

5. Benchmarks and tuning

Incremental prefill

Features

How it works

Benchmark results

1. Latency — normal vs incremental flush at Enter

Perceived speedup

3. GPU Load — BPE margin saves 40–59% total compute vs naive

5. UX — mid-line edit recovery cost

7. KV Cache Quantization + TurboQuant

8. Samplers & Temperature

Usage

Using the incremental prefill API

GPU acceleration

Hugging Face model download

Development

Build Optimizations

Updating llama.cpp

Multimodal Images

Via the OpenAI-compatible server

Standalone multimodal example

Credits

Citation

License

Copyright

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 44

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages