A Cursor-Auto / Claude-tier-style serving setup for local GGUF models, role-aware: coder models for agent work, chat models for planning, with an uncensored chat option for plans that need it.
Each tier can be served by either a local GGUF (default) or a hosted
remote model via LiteLLM — useful for the
top-tier weights that don't fit on a laptop, and for any provider LiteLLM
supports (Anthropic, OpenAI, Groq, Bedrock-as-a-LiteLLM-provider, etc.).
Both backends share the same auto router, so opencode/curl/Cursor never
need to know which one a tier resolves to.
Built on:
llama.cpp— inference engine (Metal backend)llama-swap— multi-model process manager + OpenAI-compatible proxylitellm— optional remote-LLM gateway (proxy + dashboard + MCP) for hosted tiers- a tiny FastAPI router that adds an
automodel with intent-based routing in front of llama-swap (which itself fronts the litellm proxy for remote tiers)
client (opencode / curl / Cursor / etc.)
│
▼
http://127.0.0.1:10101 <-- FastAPI router (llmstack.app)
│ • model="auto" → classify → rewrite to one of 3 coder tiers
│ • everything else → pass-through
▼
http://127.0.0.1:10102 <-- llama-swap (binary, manages model lifecycle)
│ • loads/unloads llama-server processes per model
│ • matrix solver allows {code-fast + one heavy model} co-resident
▼
llama-server <code-fast | code-smart | code-ultra>
│
▼
GGUF in ~/.cache/huggingface/hub/...
The whole thing is a pure Python package distributed via standard Python tooling
(pip install opencode-llmstack, or pip install -e . from this repo). Once installed
you get a single llmstack console-script. For enabling remote LLMs via litellm, you need to install
optional (pip install opencode-llmstack[litellm]).
A 64 GB unified memory M4 Max can comfortably hold one always-on tiny coder + one heavy model simultaneously. We split heavy models by role:
- Agent work (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
- Planning (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
- Uncensored planning is a separate plan-tier model, opted in by explicit agent selection (
/agent plan-nofilterin opencode).
Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
| Alias | Model | Quant | Weights | Context | Temp | Role |
|---|---|---|---|---|---|---|
code-fast |
Qwen2.5-Coder-3B-Instruct | Q5_K_M | ~2.5 GB | 128k (YaRN ×4) | 0.2 | autocomplete, FIM, single-line edits, quick Q&A. Always loaded. |
code-smart |
Qwen3-Coder-Next 80B-A3B (MoE) | Q4_K_M (→ UD-Q4_K_XL) | ~45 GB | 64k | 0.5 | agent mode: multi-file edits, tool calls, refactors, debugging |
plan |
Qwopus GLM 18B Merged | Q4_K_M | ~9 GB | 64k (2× native) | 0.7 | plan mode: design, architecture, trade-off discussions |
plan-uncensored |
Mistral-Small 3.2 24B Heretic (i1) | i1-Q4_K_M (→ i1-Q6_K) | ~13 GB | 128k (native) | 0.85 | plan mode, no filter: when the topic requires it |
Temperature ladder (low → high = "doing" → "thinking"): code-fast 0.2 (deterministic) · code-smart 0.5 (balanced agent) · plan 0.7 (creative ideation) · plan-uncensored 0.85 (max exploration).
opencode agent.<name>.temperature is set to match — clients can still override per request.
The router runs a step-down fidelity ladder: start at the top tier for new / short conversations, drop down as the context grows. This inverts the classic "escalate when input gets big" pattern, and it matches how these models actually behave on this stack:
- Top-tier remote (Claude Opus/Sonnet via litellm) — fastest and most accurate on short prompts, but per-request latency and $cost scale with input tokens, and long-context behaviour degrades faster than headline benchmarks suggest.
code-smart(Qwen3-Coder 80B) — 64k window. Sweet spot is the middle of that range; saturates near the top.code-fast(Qwen2.5-Coder 3B + YaRN ×4) — 128k window, always-resident, free. Smaller models lean on explicit context rather than priors, so they tend to improve relative to top-tier as the conversation grows.
First match wins (auto-routing only; plan and plan-uncensored are not auto-routed):
| # | Condition | → Model | Reason |
|---|---|---|---|
| 1 | [ultra] / [opus] / ultra: trigger AND code-ultra tier configured |
code-ultra |
explicit top-tier opt-in |
| 2 | estimated input ≤ 12 000 tokens | code-ultra (or code-smart if ultra unwired) |
top tier — context still being built, latency/$ are best here |
| 3 | estimated input ≤ 32 000 tokens | code-smart |
mid-context, local heavy coder is at its sweet spot |
| 4 | otherwise (long context) AND ≥ 10 user turns | code-smart |
floor: deep agentic loop, keep the heavy model |
| 5 | otherwise (long context) | code-fast |
128k YaRN window + always-resident + free |
Token estimates are chars / 4 over all message text + prompt. The
code-ultra rungs (1 and 2) are gated on availability: when no
[code-ultra] section is loaded from models.ini, both silently fall
back to code-smart so vanilla installs don't 404.
llmstack install generates an opencode config at
<work-dir>/.llmstack/opencode.json (derived from models.ini), where
<work-dir> is whatever directory you ran llmstack from (or
$LLMSTACK_WORK_DIR). You can cd into any project and run
llmstack install to get a project-local config there. The script also
copies AGENTS.md next to the generated JSON, so the .llmstack/ folder
is a self-contained opencode bundle. Your global
~/.config/opencode/opencode.json is never modified by this stack.
opencode picks up our config because llmstack start (and llmstack shell) drop you into a subshell with these env vars exported:
| Env var | Value |
|---|---|
OPENCODE_CONFIG |
<work-dir>/.llmstack/opencode.json (overrides global, sits below project configs) |
LLMSTACK_CHANNEL |
current, next, or external (thin client of an llmstack router, see below) |
LLMSTACK_ACTIVE |
1 (used to refuse recursive entry) |
LLMSTACK_ROOT |
absolute path to the installed llmstack package |
The llama-swap and router daemons are singleton on ports 10101/10102.
The channel is pinned at install time in .llmstack/default-channel
and never auto-detected at runtime — one project on the host owns the
daemons (installed local), and any other project on the same host that
wants to consume them is installed --external (defaulting to
http://127.0.0.1:10101). This avoids the footgun where a "shared"
project's stop would tear down daemons it can't bring back up.
The shell's prompt is prefixed with [llmstack:<channel>] so you always
know whether you're in the env or not. Bash and zsh source your normal
rc first, then add the prefix; other shells just get the env vars.
Inside the subshell, run opencode and it will pick up the wiring
below. Outside the subshell (any other terminal), opencode keeps using
your global setup unchanged.
| opencode agent | Local model |
|---|---|
default model |
llama.cpp/auto (router-routed) |
small_model (titles, tasks, tab autocomplete) |
llama.cpp/code-fast |
agent.build (default builder) |
llama.cpp/code-smart |
agent.plan (read-only planner) |
llama.cpp/plan |
agent.plan-nofilter (custom uncensored planner) |
llama.cpp/plan-uncensored |
Inside opencode you can switch agents with /agent or by @plan-nofilter-mentioning
a custom one. The plan and plan-uncensored tiers are not auto-routed from the build agent —
they're only accessible via explicit agent selection (/agent plan or /agent plan-nofilter).
Want a second terminal into the same stack? Install the activate hook
once (eval "$(llmstack activate zsh)") and any new shell that cds
into the project picks up OPENCODE_CONFIG automatically. Want to run
opencode without the hook? OPENCODE_CONFIG=$PWD/.llmstack/opencode.json opencode
from any directory you previously ran install in.
llmstack/ # repo root
├── pyproject.toml # package metadata + `llmstack` console script
├── README.md # this file
├── UPGRADING.md # how to swap any tier for a newer/better model
│ + how to upgrade the Python toolchain itself
├── LICENSE # MIT
└── llmstack/ # the python package (importable, installable)
├── __init__.py
├── __main__.py # `python -m llmstack`
├── cli.py # arg dispatch (the `llmstack` console-script)
├── paths.py # state / bin / work dir resolution + env overrides
├── shell_env.py # spawn the env-prepared subshell + activate hooks
├── app.py # FastAPI auto-router (~280 lines)
├── tiers.py # parse models.ini -> Tier dataclasses
├── models.ini # SINGLE SOURCE OF TRUTH for tiers + sampler (bundled template)
├── check_models.py # snapshot tool (HF metadata + drift check)
├── AGENTS.md # opencode agent template (shipped as package data)
├── generators/
│ ├── llama_swap.py # render llama-swap.yaml from models.ini
│ └── opencode.py # render opencode.json from models.ini
├── download/
│ ├── ggufs.py # background GGUF downloader
│ └── binary.py # llama-swap release downloader
└── commands/ # one module per CLI action
├── setup.py # first-time walkthrough
├── install.py # generate opencode.json (+ AGENTS.md copy)
├── install_llama_swap.py
├── download.py
├── start.py
├── stop.py
├── restart.py
├── reload.py
├── status.py
├── check.py
└── activate.py
Per-project state (gitignored) is created lazily under <work-dir>/.llmstack/:
.llmstack/
├── opencode.json consumed via OPENCODE_CONFIG (written by `install`)
├── AGENTS.md copy of the package template (written by `install`)
├── llama-swap.yaml generated runtime config (written by `start`)
├── default-channel pinned by `llmstack install`
├── active-channel written by `llmstack start`, removed by `stop`
├── llama-swap.pid daemon pid files
├── router.pid
├── llmstack.bashrc prompt-prefix rcfile (bash)
├── zdotdir/ prompt-prefix rcfile (zsh)
└── logs/
├── llama-swap.log
├── router.log
└── dl-*.log
The llama-swap binary lives outside any project at
$XDG_DATA_HOME/llmstack/bin/llama-swap on macOS/Linux (override with
LLMSTACK_BIN_DIR), or %LOCALAPPDATA%\llmstack\bin\llama-swap.exe on Windows.
One download is reused across all projects.
Everything runs through one entry point: llmstack <action>.
Run llmstack help to see all actions and options.
# 0. Install the package (editable, from this repo).
python3 -m venv .venv
.venv/bin/pip install -e .
# 1. (Recommended) raise GPU-wired memory to fit code-fast + code-smart together.
sudo sysctl iogpu.wired_limit_mb=57344
# 2. Full setup: download GGUFs, wait, install the llama-swap binary, print
# the activation hook, check opencode is on PATH. Stepwise & idempotent;
# re-running it later is safe.
llmstack setup
# 3. Generate this project's .llmstack/opencode.json (+ AGENTS.md copy).
# `install` does NOT touch llama-swap.yaml -- that's regenerated
# fresh by `start` for the channel you're booting into.
llmstack install
# 4. Generate .llmstack/llama-swap.yaml for the chosen channel, bring up
# llama-swap + router. With the activate hook installed (see below),
# your prompt is already wired to .llmstack/opencode.json -- just run
# `opencode`. Without the hook, `start` falls back to spawning a
# subshell with OPENCODE_CONFIG set, prefixed with [llmstack:current].
# Daemons keep running when you exit; stop them with `llmstack stop`.
llmstack start
# 4a. Daemons only (no fallback subshell, return immediately).
llmstack start --detach
# 4b. Want auto-activation in any new terminal you cd into? Install once:
eval "$(llmstack activate zsh)"
# add the same line to ~/.zshrc to make it stick.
# 5. Sanity check (works from any terminal)
llmstack status
curl -s http://127.0.0.1:10101/v1/models | jq '.data[].id'
curl -s http://127.0.0.1:10101/models.ini | head # what thin clients seeTo stop everything: llmstack stop.
The CLI runs the same way on Windows (PowerShell or cmd.exe); the only
moving parts that differ are the binary asset and the activation hook.
# 0. Install the package (editable, from this repo).
py -3 -m venv .venv
.venv\Scripts\pip install -e .
# 1. Pull GGUFs + the windows_amd64 llama-swap binary (lives under
# %LOCALAPPDATA%\llmstack\bin\llama-swap.exe).
.venv\Scripts\llmstack setup
# 2. Generate this project's .llmstack\opencode.json (+ AGENTS.md copy).
.venv\Scripts\llmstack install
# 3. Generate .llmstack\llama-swap.yaml for the chosen channel, bring up
# the stack. If you've installed the activate hook (step 4) the
# current shell is already wired to .llmstack\opencode.json; otherwise
# `start` falls back to spawning a PowerShell subshell.
.venv\Scripts\llmstack start
# 4. Auto-activate per project from any new PowerShell window. The hook
# file is a .ps1 (PowerShell won't dot-source it without that
# extension) and dot-sourcing it requires script execution to be
# allowed -- if you see "running scripts is disabled on this
# system", run once:
# Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
llmstack activate powershell | Out-String | Invoke-Expression
# or persist (writes ~/.powershell_llmstack_hook.ps1 + sources it on every shell):
"llmstack activate powershell | Out-String | Invoke-Expression" | Add-Content $PROFILENotes:
- Only
windows_amd64llama-swap binaries are published upstream; arm64 Windows is not supported. GPU acceleration uses whatever backendllama-serverwas built with (CUDA / Vulkan / CPU) -- getllama-server.exefrom the llama.cpp Windows releases or a package likewinget install ggml.llama-cppand put it onPATH(or set$env:LLAMA_SERVER_BIN). The Mac-onlyiogpu.wired_limit_mbstep does not apply. - The
[llmstack:<channel>]prompt prefix shows up in PowerShell;cmd.exedoes not support custom prompts in the same way, so activation is PowerShell-only. - Stopping daemons uses
taskkill /T /Funder the hood, so the llama-server children get cleaned up as well.
llmstack install --external [URL] wires this project as a thin client
of an llmstack router — no llama-swap, no router, no GGUFs needed
locally, and no local models.ini. The thin-client install:
- Fetches
GET URL/models.inilive from the router (this also doubles as the health check — a 200 with valid INI proves the router is up). - Renders
opencode.jsonagainst the fetched content so tier names- descriptions agree with what the router actually serves.
- Pins
.llmstack/default-channel = "external <url>"so subsequent commands know they're in client mode.
There is no client-side cache: every install re-fetches. To pick up
a tier edit on the router, just re-run llmstack install here.
URL precedence at install time: --external <url> arg > $LLMSTACK_REMOTE_URL
env var > the local router (http://127.0.0.1:10101). You normally
don't set the env var yourself — the activate hook does it for you
when you cd into an external-installed project (see below).
Two flavours of the same mode:
Same host, two projects. One project owns the daemons (local install), the others are thin clients of localhost. Zero config:
# project A — owns the daemons
cd ~/projA && llmstack install && llmstack start
# project B — consumes them
cd ~/projB && llmstack install --external
# baseURL = http://127.0.0.1:10101/v1
# default-channel = "external http://127.0.0.1:10101"
# (no local models.ini -- fetched from project A's router)
llmstack start # verifies /models.ini, drops into the client subshellDifferent host. Point at a beefy desktop's router from a laptop:
# laptop -> desktop running llmstack on 10.0.0.5
llmstack install --external http://10.0.0.5:10101
llmstack start # verifies http://10.0.0.5:10101/models.ini
opencode # talks straight to the remote router(LLMSTACK_REMOTE_URL=http://10.0.0.5:10101 llmstack install also
works — the env var is honoured as an alternative way in.)
The URL is persisted into the channel marker, so any new terminal you
open with the activate hook installed (eval "$(llmstack activate zsh)")
will re-export LLMSTACK_REMOTE_URL automatically when you cd into
the project. The prompt is medium-purple with the URL:
[llmstack:<project> http://10.0.0.5:10101]. From inside that
activated shell, llmstack install re-fetches models.ini without
needing the flag or URL again.
The local commands that manage local resources (setup, download,
install-llama-swap) refuse when the project is installed --external.
stop is a no-op (nothing local to tear down) — to stop the daemons
themselves, run llmstack stop from the project that owns them (the
one installed local).
llmstack activate <shell> writes the hook to
~/.<shell>_llmstack_hook and prints a source line to stdout, so a
single eval both regenerates the file and turns the hook on in your
current shell. Pasting the same eval into your rc keeps it on for
every new shell:
# ~/.zshrc (zsh)
eval "$(llmstack activate zsh)"
# or ~/.bashrc (bash)
eval "$(llmstack activate bash)"With the hook installed, cd into any project that has a .llmstack/
and your shell is wired up automatically — OPENCODE_CONFIG,
LLMSTACK_WORK_DIR, LLMSTACK_CHANNEL (and LLMSTACK_REMOTE_URL for
projects installed --external) all toggle on/off as you walk in and
out. There is no separate llmstack shell command — this is the shell
command.
llmstack install # opencode.json + AGENTS.md (no GGUF downloads)
llmstack install-llama-swap --force # re-pull llama-swap binary only
llmstack setup --skip-download # full setup minus the GGUF pull
llmstack setup --skip-wait # kick off downloads in background, install now
llmstack check # snapshot configured GGUFs + flag drift
llmstack start --next # try queued hf_file_next upgrades (reversible)
llmstack restart --next # cycle into the next channelAll of these go to /v1/chat/completions on :10101. The auto router classifies based on token count and context:
# trivial chat -> code-fast
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"auto","stream":false,
"messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
# agent work -> code-smart
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"auto","stream":false,
"messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .modelTo access plan or plan-uncensored tiers, use explicit agent selection in opencode (/agent plan or /agent plan-nofilter) rather than model=auto.
| Port | Service | Purpose |
|---|---|---|
| 10101 | router (FastAPI) | What clients hit. OpenAI-compatible. Adds auto model. |
| 10102 | llama-swap | Lifecycle manager. Useful UI at http://127.0.0.1:10102/ui/. |
| 10001+ | llama-server children | Internal, allocated dynamically per model. |
The router exposes:
GET /models.ini← raw config text (used byinstall --externaland as the health check)GET /v1/models← injectsautothen proxies the restPOST /v1/chat/completions← classify ifmodel=="auto", then proxyPOST /v1/completions← same*← pass-through reverse proxy
There is no /health route on the router — GET /models.ini
returning a 200 + valid INI is the canonical "router is up and
configured" signal. (Hitting /health still works for legacy curl
users, but it's just the catch-all proxying through to llama-swap's
own /health endpoint.)
macOS caps GPU-wired memory at ~48 GB (75 % of RAM) by default. To unlock more for the GPU:
sudo sysctl iogpu.wired_limit_mb=57344 # 56 GB to GPU; survives until rebootResident with our defaults (KV q8_0, full configured context):
| Combo | Weights | + KV | Total | Status |
|---|---|---|---|---|
code-fast + code-smart (Q4_K_M) |
47.5 GB | ~5 GB | ~53 GB | needs wired_limit bump |
code-fast + code-smart (UD-Q4_K_XL) |
~52 GB | ~5 GB | ~57 GB | needs wired_limit bump |
code-fast + plan |
11.5 GB | ~4.5 GB | ~16 GB | trivial |
code-fast + plan-uncensored |
15.5 GB | ~12.5 GB | ~28 GB | trivial |
code-fast + plan + plan-uncensored |
~25 GB | ~14.5 GB | ~40 GB | both chats together |
code-smart + plan-uncensored |
58 GB | … | ❌ | matrix forbids |
KV cache only fills up as context grows — these are worst-case numbers at the configured max context. Typical usage will be far less.
The matrix declares which combinations are valid. When you ask for a model that isn't currently loadable, the solver picks the cheapest set to swap into.
All three pre-queued upgrades are same-model, higher-quant — drop-in replacements with no behaviour change beyond quality.
Logs are named dl-<tier>-<label>.log where <label> is current (file
in models.ini hf_file) or next (file in models.ini hf_file_next).
When this log shows EOF (download done) |
…edit llama-swap.yaml -hff line in this tier |
…to |
|---|---|---|
logs/dl-code-smart-next.log |
code-smart |
Qwen3-Coder-Next-UD-Q4_K_XL.gguf |
logs/dl-plan-next.log |
plan |
Qwopus-GLM-18B-Healed-Q6_K.gguf |
logs/dl-plan-uncensored-next.log |
plan-uncensored |
Mistral-Small-3.2-24B-Instruct-2506-ultra-uncensored-heretic.i1-Q6_K.gguf |
The -hf <repo> lines stay the same; only the -hff <filename> line changes.
After editing, also flip hf_file ↔ hf_file_next in models.ini so
llmstack check no longer reports DRIFT!.
Then llmstack restart.
For changing to a different model entirely (different family/provider) see UPGRADING.md.
Router knobs live in .llmstack/models.ini; edit and re-run
llmstack install to apply (llmstack restart to reload daemons).
models.ini key |
Default | Meaning |
|---|---|---|
[DEFAULT] host |
127.0.0.1 |
router listen host (also used as upstream llama-swap host) |
[DEFAULT] router_port |
10101 |
router listen port |
[ROUTING] high_fidelity_ceiling |
12000 |
tokens; at or below this, route to top tier (ultra → smart fallback). Paired with code-ultra.ctx_size = 24000 (2x). |
[ROUTING] mid_fidelity_ceiling |
32000 |
tokens; at or below this, route to code-smart; beyond, step down to code-fast. Paired with code-smart.ctx_size = 64000 (2x). |
[ROUTING] multi_turn |
10 |
user-turn count that floors the long-context rung at code-smart. |
Auto-router rungs (fast / agent / ultra) are resolved by
matching the role field of each [tier] block, so renaming a tier
section needs no further config. The only env var the router still
honours is LOG_LEVEL (default info).
To force a request to never auto-route, set model to a concrete alias (code-fast, code-smart, plan, plan-uncensored, or any of their listed aliases like agent, glm, nofilter, …).
The plan-uncensored tier is accessible via explicit agent selection only:
- In opencode:
/agent plan-nofilter(or mention@plan-nofilter). - Via opencode config: set
agent.plan-nofilteras your active agent.
llama-swap won't start → check .llmstack/logs/llama-swap.log. Most common causes: port 10102 already in use, or a typo in llama-swap.yaml.
First request hangs for ~60 s → that's the model loading from disk into Metal memory. sendLoadingState: true will surface "loading…" in the SSE stream. After it's loaded subsequent requests are instant.
OOM / unexplained slowdown → run top -o mem -stats pid,rsize,command to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, llmstack restart.
Auto picks the wrong model → adjust the regex in llmstack/app.py (ULTRA_TRIGGERS) or move the ladder ceilings via the [ROUTING] section of models.ini (high_fidelity_ceiling / mid_fidelity_ceiling). To force a request to never auto-route, pass an explicit model (e.g. code-smart) instead of auto.
Want a pure pass-through (no auto routing) → change opencode's baseURL to http://127.0.0.1:10102/v1 (llama-swap directly) and only use concrete model names. (LiteLLM tiers remain reachable via llama-swap, since the litellm_proxy model is registered there with each tier as an alias; only the auto rewriting is skipped.)
logs/dl-*.log is multi-GB and growing → you're hitting llama.cpp issue #14802 where modern llama-cli is chat-only and ignores -no-cnv, looping > prompts forever (~1.5 MB/s). Fix: llmstack download already prefers llama-completion over llama-cli when both are present (brew install llama.cpp ships both as of 2025). If you only have legacy llama-cli, either upgrade llama.cpp or kill the runaways with pkill -9 -f llama-cli.
LiteLLM tier 401 / auth errors → the router process didn't see your provider API key. Credentials are read from environment variables at process start (ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, AWS_* for Bedrock-via-LiteLLM, …). Export them in the shell before llmstack start, or set them in .llmstack/litellm_config.yaml per-model via os.environ/<NAME>. Restart the stack after changing creds — they're not picked up live.
Any tier in models.ini that declares a model = <provider>/<model-id> key is
served by LiteLLM instead of llama-swap. The same
tier names + auto-routing apply, so swapping code-smart from a local GGUF to
Claude Sonnet on Anthropic is a models.ini edit + llmstack install +
llmstack restart away — clients don't change.
[code-smart]
tier = code
role = agent
backend = litellm
model = anthropic/claude-sonnet-4-20250514
ctx_size = 200000
max_output_tokens = 16384
sampler = temp=0.5 ; Sonnet 4 accepts ONE of temp / top_p
description = Claude Sonnet 4 via Anthropic API - heavy coder for agent loopsThe bundled models.ini template ships every tier with a commented-out
"LiteLLM alternative" block beneath the active GGUF block. To swap a tier:
comment out the GGUF block, uncomment the litellm block, and run llmstack install && llmstack restart. The code-ultra tier is shipped fenced with
AUTO-ENABLE-WHEN-LITELLM-AVAILABLE markers — llmstack install
auto-uncomments it on first seed when import litellm succeeds, otherwise
it stays inert.
Sampler is per-tier, declared in
models.ini, applied per backend.opencode.jsonis intentionally sampler-free in both cases — clients just specify a model. How the sampler reaches the actual inference engine depends on the backend:
- gguf tiers — the llama-swap generator bakes each tier's
sampler = …keys into itsllama-serverstartup command line as--temp/--top-p/--top-k/--min-p/--repeat-penaltyflags. llama-server applies them as its defaults for every request. The router doesn't touch the body.- LiteLLM tiers — LiteLLM has no server-side defaults mechanism, so the router injects the sampler keys into each outbound request body (mapping
temp→temperature,top_p→top_p; the other llama.cpp-extension keystop_k/min_p/rep_penare passed through and silently dropped by LiteLLM if the provider doesn't accept them —litellm_settings.drop_params: trueis set in the bundledlitellm_config.yaml). Caller-supplied values in the request body still win for per-call overrides.Per-provider sampler rules (declare only what your model accepts):
Provider / model family What samplermay containClaude Opus 4 (4.0+) (omit sampler =entirely — Opus 4 rejects all sampler params)Claude Sonnet 4 / Haiku 4.5 temportop_p, never bothClaude Opus 3.x tempand/ortop_pOpenAI / Llama / Groq / Mistral / Cohere / etc. tempand/ortop_p(check the model card)Local gguf tiers accept the full set (
temp,top_p,top_k,min_p,rep_pen) — llama-server honours all of them as startup defaults.
models.ini is meant to be committable, so it only names the model.
Credentials live in environment variables (or LiteLLM's own config), and
LiteLLM resolves them transparently:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GROQ_API_KEY=gsk-...
# AWS Bedrock through LiteLLM uses the boto3 chain (env vars / ~/.aws/*)
export AWS_REGION=eu-west-3See LiteLLM provider docs for
provider-specific setup (Bedrock role chaining, Azure deployment names,
custom endpoints, …). Per-tier credential overrides live in
<work-dir>/.llmstack/litellm_config.yaml — llmstack install
non-destructively merges new tier stubs into its model_list, so any
edits you make to existing entries (custom api_base, per-model API
keys via os.environ/<NAME>, retries, fallbacks, …) survive across
installs.
Key (in models.ini) |
Meaning |
|---|---|
model |
LiteLLM model id (anthropic/claude-sonnet-4-..., openai/gpt-4o-..., bedrock/eu.anthropic.claude-..., groq/llama-3.1-70b-..., etc.). Required. |
model_next |
Queued upgrade target. Mirrors gguf hf_file_next: llmstack install --next swaps the tier to this model id until you switch back; permanent promotion is model edit + llmstack install. |
max_output_tokens |
Cap on output tokens for the tier. Useful for cost discipline on top-tier models. |
backend = litellm |
Optional explicit override; auto-detected when model is set. |
Banned in models.ini (file is meant to be committable): any key holding
secrets — API keys, AWS access keys, session tokens, role ARNs, etc. Put
them in environment variables, your provider's standard credential file
(~/.aws/credentials, ~/.config/openai/...), or the per-project
litellm_config.yaml referencing os.environ/<NAME>.
Install the LiteLLM extra (it's opt-in so the local-only path stays small):
pip install -e '.[litellm]' # editable, from this repo
pip install 'opencode-llmstack[litellm]' # from PyPIThe extra pulls in litellm[proxy] which provides the litellm CLI.
Internally, llmstack start registers the proxy as a model in
llama-swap (litellm_proxy on 127.0.0.1:10103) with every
litellm-backed tier as an alias. A request for code-smart (or any
alias) flows: client → router (:10101) → llama-swap (:10102) →
litellm proxy (:10103) → provider. Streaming and non-streaming both
work; tool calls are passed through. The proxy's admin UI lives at
http://127.0.0.1:10103/ui and its MCP gateway at /mcp.
Hosted tiers are skipped by llmstack download (nothing to fetch) and
by the llama-swap.yaml matrix (nothing to load locally). They show up
in llmstack check with the model id instead of HF metadata, and in
/v1/models alongside the local GGUF tiers — including a
channel: current|next metadata field so clients can tell which model
id they're actually talking to.
llmstack install --next flips both backends in lock-step: gguf tiers
swap to hf_file_next and litellm tiers swap to model_next (the
router subprocess sees LLMSTACK_USE_NEXT=1 and rewrites
body["model"] to <tier>_next, which llama-swap routes to the
<tier>_next entry in litellm_config.yaml). Either backend having a
queued upgrade is enough to satisfy --next.
See UPGRADING.md — covers why models must be GGUF, where to
find candidates, how to evaluate "better" per tier, the safe upgrade workflow,
and a worked example. Run llmstack check for a snapshot of what's
currently configured along with HF URLs to compare against.