Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 2 additions & 16 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
"logo": {
"light": "/logo/light.svg",
"dark": "/logo/dark.svg",
"href": "/"
"href": "https://liquid.ai"
},
"navbar": {
"links": [
Expand All @@ -52,20 +52,6 @@
}
},
"navigation": {
"global": {
"anchors": [
{
"anchor": "About Us",
"icon": "building",
"href": "https://www.liquid.ai/company/about"
},
{
"anchor": "Blog",
"icon": "pencil",
"href": "https://www.liquid.ai/company/blog"
}
]
},
"tabs": [
{
"tab": "Documentation",
Expand Down Expand Up @@ -202,7 +188,7 @@
]
},
{
"tab": "Guides",
"tab": "Examples",
"groups": [
{
"group": "Get Started",
Expand Down
2 changes: 1 addition & 1 deletion docs/help/faqs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ For most use cases, Q4_K_M or Q5_K_M provide good quality with significant size
## Fine-tuning

<Accordion title="Can I fine-tune LFM models?">
Yes! Most LFM models support fine-tuning with [TRL](/lfm/fine-tuning/trl) and [Unsloth](/lfm/fine-tuning/unsloth). Check the [Complete Model Library](/lfm/models/complete-library) for trainability information.
Yes! Most LFM models support fine-tuning with [TRL](/docs/fine-tuning/trl) and [Unsloth](/docs/fine-tuning/unsloth). Check the [Model Library](/docs/models/complete-library) for trainability information.
</Accordion>

<Accordion title="What fine-tuning methods are supported?">
Expand Down
135 changes: 15 additions & 120 deletions docs/inference/llama-cpp.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -114,67 +114,25 @@

## Basic Usage

llama.cpp offers three main interfaces for running inference: `llama-cpp-python` (Python bindings), `llama-server` (OpenAI-compatible server), and `llama-cli` (interactive CLI).
llama.cpp offers two main interfaces for running inference: `llama-server` (OpenAI-compatible server) and `llama-cli` (interactive CLI).

<Tabs>
<Tab title="llama-cpp-python">
For Python applications, use the `llama-cpp-python` package.

**Installation:**
```bash
pip install llama-cpp-python
```

For GPU support:
```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```

**Model Setup:**
```python
from llama_cpp import Llama

# Load model
llm = Llama(
model_path="lfm2.5-1.2b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_threads=8
)

# Generate text
output = llm(
"What is artificial intelligence?",
max_tokens=512,
temperature=0.7,
top_p=0.9
)
print(output["choices"][0]["text"])
```

**Chat Completions:**
```python
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing."}
],
temperature=0.7,
max_tokens=512
)
print(response["choices"][0]["message"]["content"])
```
</Tab>

<Tab title="llama-server">
llama-server provides an OpenAI-compatible API for serving models locally.

**Starting the Server:**
```bash
llama-server -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF -c 4096 --port 8080
```

The `-hf` flag downloads the model directly from Hugging Face. Alternatively, use a local model file:
```bash
llama-server -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --port 8080
```

Key parameters:
* `-m`: Path to GGUF model file
* `-hf`: Hugging Face model ID (downloads automatically)
* `-m`: Path to local GGUF model file
* `-c`: Context length (default: 4096)
* `--port`: Server port (default: 8080)
* `-ngl 99`: Offload layers to GPU (if available)
Expand Down Expand Up @@ -216,12 +174,18 @@
<Tab title="llama-cli">
llama-cli provides an interactive terminal interface for chatting with models.

```bash
llama-cli -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF -c 4096 --color -i
```

The `-hf` flag downloads the model directly from Hugging Face. Alternatively, use a local model file:
```bash
llama-cli -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --color -i
```

Key parameters:
* `-m`: Path to GGUF model file
* `-hf`: Hugging Face model ID (downloads automatically)
* `-m`: Path to local GGUF model file
* `-c`: Context length
* `--color`: Colored output
* `-i`: Interactive mode
Expand All @@ -236,49 +200,12 @@
Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:

* **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
* **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0

Check warning on line 203 in docs/inference/llama-cpp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/llama-cpp.mdx#L203

Did you really mean 'top_p'?
* **`top_k`** (`int`, default 40): Limits to top-k most probable tokens. Typical range: 1-100
* **`max_tokens`** / **`--n-predict`** (`int`): Maximum number of tokens to generate
* **`repetition_penalty`** / **`--repeat-penalty`** (`float`, default 1.1): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
* **`stop`** (`str` or `list[str]`): Strings that terminate generation when encountered

<Accordion title="llama-cpp-python example">
```python
from llama_cpp import Llama

llm = Llama(
model_path="lfm2.5-1.2b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_threads=8
)

# Text generation with sampling parameters
output = llm(
"What is machine learning?",
max_tokens=512,
temperature=0.7,
top_p=0.9,
top_k=40,
repeat_penalty=1.1,
stop=["<|im_end|>", "<|endoftext|>"]
)
print(output["choices"][0]["text"])

# Chat completion with sampling parameters
response = llm.create_chat_completion(
messages=[
{"role": "user", "content": "Explain quantum computing."}
],
temperature=0.7,
top_p=0.9,
top_k=40,
max_tokens=512,
repeat_penalty=1.1
)
print(response["choices"][0]["message"]["content"])
```
</Accordion>

<Accordion title="llama-server (OpenAI-compatible API) example">
```python
from openai import OpenAI
Expand All @@ -305,7 +232,7 @@

## Vision Models

LFM2-VL GGUF models can be used for multimodal inference with llama.cpp.

Check warning on line 235 in docs/inference/llama-cpp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/llama-cpp.mdx#L235

Did you really mean 'multimodal'?

### Quick Start with llama-cli

Expand Down Expand Up @@ -407,45 +334,13 @@
```
</Accordion>

<Accordion title="Using llama-cpp-python">
```python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize with vision support
# Note: Use the correct chat handler for your model architecture
chat_handler = Llava15ChatHandler(clip_model_path="mmproj-model-f16.gguf")

llm = Llama(
model_path="lfm2.5-vl-1.6b-q4_k_m.gguf",
chat_handler=chat_handler,
n_ctx=4096
)

# Generate with image
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
max_tokens=256
)
print(response["choices"][0]["message"]["content"])
```
</Accordion>

<Info>
For a complete working example with step-by-step instructions, see the [llama.cpp Vision Model Colab notebook](https://colab.research.google.com/drive/1q2PjE6O_AahakRlkTNJGYL32MsdUcj7b?usp=sharing).
</Info>

## Converting Custom Models

If you have a finetuned model or need to create a GGUF from a Hugging Face model:

Check warning on line 343 in docs/inference/llama-cpp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/llama-cpp.mdx#L343

Did you really mean 'finetuned'?

```bash
# Clone llama.cpp if you haven't already
Expand Down
Loading