The only setup that actually works. Run Claude Code with local LLMs on Apple Silicon — real tool execution, real agentic loops, fully offline.
Every tutorial out there tells you to point Claude Code at Ollama or llama.cpp and call it a day. None of them work. The model generates text that looks like a tool call, but nothing executes. No files get created, no commands run, no code gets written. You're watching a convincing hallucination.
This project uses vllm-mlx — the only backend that speaks Claude Code's native language: the Anthropic Messages API with real tool_use content blocks. When the model decides to read a file, it actually reads the file. When it writes code, the code lands on disk. The agentic loop works — tool calls chain into tool results, the model iterates, and you get the real Claude Code experience running entirely on your hardware.
No API key. No cloud. No subscription. No data leaves your machine. Just ./install.sh and go.
- Apple Silicon Mac (M1/M2/M3/M4/M5)
- 16GB+ unified memory (24GB+ recommended)
- Claude Code installed
- Homebrew
git clone https://github.com/vitorallo/claude-code-local.git
cd claude-code-local
./install.sh
cclocalFirst run downloads the default model (~5GB, one-time). Then starts vllm-mlx and launches Claude Code.
In Claude Code, type:
create a file called /tmp/test_tools.txt with "hello world"
Working: Claude Code calls the Write tool, creates the file, confirms. Broken: Claude Code generates text saying it created the file, but nothing exists on disk.
| Flag | Model | Size | RAM needed | Notes |
|---|---|---|---|---|
(default) --gemma-light |
Gemma-4-E4B | ~5GB | 16GB+ | Clean tool calling, verified end-to-end |
--gemma |
Gemma-4-26B-A4B MoE | ~16GB | 24GB+ | Google MoE, 3.8B active params |
--review |
GLM-4.7-Flash | ~17GB | 24GB+ | Stronger reasoning |
--coder |
Qwen3-Coder-30B-A3B | ~18GB | 24GB+ | Heavier code model |
--qwen3 |
Qwen3.5-9B | ~5GB | 16GB+ | General reasoning — leaks plain-text thinking [1] |
--coder7b |
Qwen2.5-Coder-7B | ~5GB | 16GB+ | Code analysis — tool calls unreliable [2] |
--light |
(alias) | Back-compat alias for --gemma-light (v2.0.1 pointed at Qwen3.5-9B) |
||
--model ID |
Any MLX model | varies | varies | Custom HuggingFace model ID (not tested) |
[1] Qwen3.5 is a hybrid-thinking model that ignores enable_thinking=false at
the template level and emits plain-text "Thinking Process:" preamble outside
<think> tags. Known upstream issue; see
vllm-project/vllm#35574
and QwenLM/Qwen3#1625. Use
only if you want general reasoning and tolerate verbose output.
[2] Qwen2.5-Coder-7B hallucinates an XML tool-call format
(<Write path="..." content="..."/>) that no parser handles. Good for
non-agentic code analysis where you feed it whole files, not for Claude
Code's tool loop. Use --gemma-light for tool calling work instead.
cclocal # Interactive menu: pick model, see what's cached, manage cache
cclocal --gemma-light # Direct launch, Gemma-4-E4B (default, clean tool calling)
cclocal --gemma # Direct launch, Gemma-4-26B MoE
cclocal --review # Direct launch, GLM-4.7-Flash
cclocal --coder # Direct launch, Qwen3-Coder-30B-A3B
cclocal --list # List cached models on disk
cclocal --rm # Manage/delete cached models (interactive)
cclocal --server # Start server only, connect Claude Code separately
cclocal -h # Show all options
# Operational flags (combine with any model flag)
cclocal --gemma --out-tokens 16384 # Bigger output budget for large file writes (default 8192)
cclocal --gemma --safe # Force the memory-safeguard menu (raise GPU limit / shrink ctx)
cclocal --gemma --no-mem-check # Skip the GPU-headroom preflight promptRunning cclocal with no arguments opens an interactive menu that shows every
supported model, indicates which are already cached on disk, and lets you pick
one or jump to a cache management screen. Use the model flags to skip the menu
when you already know what you want.
You don't configure these; run.sh applies them. Listed here so the behaviour
isn't a surprise. Full root-cause writeups in Why this is hard
(#16–#18) and the field report.
- Memory preflight. Before serving, it estimates the model footprint vs.
the GPU budget (the
iogpu.wired_limit_mbcap, or ~75% of RAM — not total RAM). If headroom is tight it offers to shrink the server context and/or raise the GPU wired limit viasudofor the session (auto-reverted on exit, never persisted across reboot). Silent when there's ample headroom;--safeforces the menu,--no-mem-checkskips it. See #17. - Output budget.
CLAUDE_CODE_MAX_OUTPUT_TOKENSdefaults to 8192 (raised from a too-small value that silently truncated file writes); override with--out-tokens N. See #18. - No classifier stall. The 8 built-in tools are pre-allowed
(
--allowedTools), so auto mode never makes the slow per-action safety-classifier model call that a serialized local model can't answer in time. Tool set stays scoped; nothing outside it is auto-approved. - Write-in-parts hint. A system-prompt line tells the model to build large files incrementally, pre-empting the truncation cycle.
- Fail-loud truncation notice. The fork is run with
--tool-call-truncation-notice: a tool call still truncated by the cap returns an explicit "write it in smaller parts" message instead of silent text. See #18. - Diagnosable logs.
server.logis rotated toserver.log.1on each launch instead of truncated, so a failed session can be inspected. - Pinned ML runtime.
install.shpulls a fork branch that pinsmlx==0.31.1/mlx-lm==0.31.1(newer versions crash generation from a worker thread). See #18.
cclocal --serverThen connect Claude Code from any terminal:
ANTHROPIC_BASE_URL=http://127.0.0.1:8000 \
ANTHROPIC_API_KEY=not-needed \
ANTHROPIC_MODEL=mlx-community/gemma-4-e4b-it-4bit \
claude --strict-mcp-config --mcp-config /path/to/claude-code-local/mcp-local.json \
--tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch"Replace
/path/to/claude-code-localwith wherever you cloned the repo. Or just usecclocal --serverwhich prints the full command for you.
Running Claude Code with a local model isn't just "point it at localhost". There are 15 problems that break the experience. This section documents every one and how run.sh handles it.
📄 For a consolidated field report — every problem tackled, root causes, the fixes/improvements, the honest model-capability limits, and how it scales to larger hardware — see
docs/running-claude-code-on-local-llms.md.
Problem: Ollama's Anthropic API adapter generates text that looks like tool calls but never emits real tool_use content blocks. Claude Code receives plain text, never executes anything. Tested with qwen3.5:9b, qwen3.5:35b-a3b, glm-4.7-flash — all produce fake tool calls.
Solution: Use vllm-mlx. It implements the native Anthropic Messages API with real tool_use / tool_result content blocks.
Problem: Claude Code needs stop_reason: "end_turn" to know the model finished. Backends returning "stop" (OpenAI convention) cause Claude Code to stop looping after the first response — no tool calls, no iteration.
Solution: vllm-mlx's native /v1/messages endpoint returns correct Anthropic stop reasons.
Problem: Qwen 3.x and Gemma 4 models emit thinking/reasoning tokens. Claude Code doesn't expect these — causes garbage output and misparses tool calls.
Solution: run.sh sets VLLM_MLX_ENABLE_THINKING=false on the server, which passes enable_thinking=False to the chat template. This suppresses thinking tokens at the template level for all models.
Problem: Claude Code's attribution header changes every request, invalidating the KV cache. Follow-up responses go from 2s to 30s+.
Solution: CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header (set by run.sh).
Problem: Claude Code calls claude-haiku-4-5-20251001 for background tasks. The local server doesn't recognize it — 404 — hang.
Solution: All model tier env vars (ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, CLAUDE_CODE_SUBAGENT_MODEL) are set to the same local model (set by run.sh).
Problem: Claude Code calls /v1/messages/count_tokens. Most local servers don't implement it.
Solution: vllm-mlx supports it. DISABLE_PROMPT_CACHING=1 reduces dependence on it.
Problem: Claude Code fires concurrent requests (main + background + subagents). Two concurrent 24K+ token prompts exceed the Metal GPU buffer limit on 24GB and crash the server.
Solution: Run in single-request mode (no --continuous-batching). Requests serialize instead of competing for Metal memory. Additionally, --kv-cache-quantization halves KV cache memory usage, giving more headroom before OOM.
Problem: Claude Code expects Anthropic SSE events. OpenAI-format streaming shows only the last token.
Solution: vllm-mlx uses native Anthropic SSE streaming.
Problem: Claude Code sends ALL tool definitions in every request. With plugins enabled, that's 200+ tools crammed into the system prompt. Even 30B models choke.
Solution: Two flags strip tools down to essentials:
--strict-mcp-config --mcp-config mcp-local.json # strips all plugin/MCP tools
--tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch" # 8 built-in tools only
Your plugins remain available when running Claude Code normally with the cloud API.
Problem: Your real ANTHROPIC_API_KEY (sk-ant-...) is set in the shell. Claude Code detects it and may send it to the local server.
Solution: env -u ANTHROPIC_API_KEY -u ANTHROPIC_AUTH_TOKEN in run.sh explicitly unsets real keys before setting the dummy one.
Problem: Claude Code tries to check for updates and send telemetry on startup, which can hang or slow down local-only sessions.
Solution: Session env vars:
DISABLE_AUTOUPDATER=1
DISABLE_TELEMETRY=1
DISABLE_ERROR_REPORTING=1
| Model | Size | Free RAM | Status |
|---|---|---|---|
| Gemma-4-E4B | ~5GB | ~19GB | Default — verified tool loop |
| Qwen3.5-9B | ~5GB | ~19GB | Works but leaks plain-text thinking |
| Qwen2.5-Coder-7B | ~5GB | ~19GB | Code analysis only — tool calls unreliable |
| Gemma-4-26B-A4B MoE | ~16GB | ~8GB | Fast inference, tight on 24GB |
| GLM-4.7-Flash | ~16.9GB | ~7GB | Works single-request only |
Problem: Earlier versions of vllm-mlx serve crashed on startup with any model:
TypeError: cannot unpack non-iterable NoneType object
In vllm_mlx/utils/tokenizer.py, the function load_model_with_fallback() was missing a return statement on the success path.
Solution: Fixed upstream and present in our fork. install.sh installs from vitorallo/vllm-mlx@claude-code-local-patches which has the fix on top of a rebased foil-patches-rebased base, plus Gemma 4 channel-token cleanup patches for Claude Code compatibility (asymmetric <|channel>thought...<channel|> handling in both non-streaming and streaming paths).
Problem: Scripts polling for server readiness grep for "ok" but vllm-mlx returns "status":"healthy".
Solution: run.sh greps for "healthy".
Problem: Setting ANTHROPIC_MODEL=default causes 404. vllm-mlx requires the full HuggingFace model ID.
Solution: run.sh passes the full model ID (e.g., mlx-community/gemma-4-e4b-it-4bit).
Problem: On first use of a model, vllm-mlx serve downloads it from HuggingFace
(5–18GB) before the server comes up. The readiness check previously did a blind
fixed-duration poll of /health with no output, so a multi-GB download that
took longer than the timeout looked like a frozen/failed launch — even though
it was downloading fine.
Solution: run.sh now watches the model's HuggingFace cache directory and
prints a live progress line while it grows:
⬇ Downloading model 8.4GB (42.3 MB/s)
⏳ Model cached (16.0GB) — loading into memory... 12s
The timeout is no longer a blind wall: it only aborts if there is no
download progress and the server is not ready for a sustained period
(STALL_LIMIT, 240s). A slow-but-progressing download never false-times-out,
and a partial download is preserved in ~/.cache/huggingface/hub so a restart
resumes rather than starting over. Progress is measured by cache-directory
size (robust) rather than scraping HuggingFace's tqdm bars from the log.
Problem: On a 24GB machine, large models (Gemma-4-26B ~16GB, Coder ~18GB) survive short prompts but the KV cache grows every turn as Claude Code feeds back tool output. Once context passes ~24K tokens the KV cache + model weights exceed the Metal memory budget and MLX throws an uncaught C++ exception:
libc++abi: terminating due to uncaught exception of type std::runtime_error:
[METAL] Command buffer execution failed: Insufficient Memory
(kIOGPUCommandBufferCallbackErrorOutOfMemory)
This kills the entire vllm-mlx process (not a recoverable per-request
error), and the in-progress Claude Code session is left retrying a dead
backend with ConnectionRefused.
Solution: run.sh runs a memory_preflight before starting the server.
The binding constraint is not total RAM — macOS only makes ~75% of RAM
GPU-addressable by default (the iogpu.wired_limit_mb cap). So the preflight
estimates the model footprint (real on-disk size if cached, else the catalog
estimate) against the effective GPU budget: the wired limit if explicitly
set, otherwise ~75% of RAM. If less than ~6GB of GPU headroom would be left
after the weights, it shows an interactive safeguard menu:
⚠ Tight memory ~15GB model, GPU budget ~18GB → ~3GB for KV cache.
Safeguards:
1) Shrink context --max-tokens 32768 → 16384 (recommended)
2) Raise GPU limit iogpu.wired_limit_mb 0 → 21504 (sudo, until reboot)
3) Both
c) Continue as-is (risky) q) Quit
Choose [1/2/3/c/q] (Enter = 1):
(This GPU-budget metric is why a ~15GB cached model with ~9GB of free RAM still OOM-crashed — only ~3GB was actually GPU-usable under the default cap.)
- Option 1 lowers the server-side context window (
--max-tokens), which is the single biggest KV-cache saver. - Option 2 raises the Metal GPU wired-memory limit via
sudo sysctl iogpu.wired_limit_mb=<RAM−3GB>. It is strictly a temporary, per-session bump: the original value is captured and reverted on exit. The revert uses a normalsudo(it will prompt for your password at shutdown if the cached credentials have expired); if you skip the prompt it prints the one-line command to restore manually. It is also not persisted across reboots — macOS resets it to the default on restart. - If the GPU limit is already at/above the recommended value, option 2 is shown as "already fine" instead of being offered.
Models that fit comfortably (e.g. 5GB models with ~13GB+ of GPU headroom)
never trigger the prompt — they launch straight through. To force the
menu for any model regardless of the heuristic, use --safe (or
CCLOCAL_FORCE_MEMCHECK=1). The check is skipped entirely with
--no-mem-check (or CCLOCAL_NO_MEMCHECK=1), auto-applies the recommended
shrink in non-interactive runs, and skips silently if the model size can't be
estimated. Recovery from a crash: exit the dead Claude session, then relaunch
with a smaller model (cclocal --gemma-light) or with --safe.
Symptom: The model "calls" a tool to write a file — Claude Code shows the
tool invocation — then silence. No file written, no error. Short-argument
tools (Bash ls) work; large Write/Edit calls don't. You end up
copy-pasting the file content out by hand.
Cause — three compounding factors, all triggered by large tool
arguments (a Write serializes the entire file body as output tokens
inside the tool-call JSON):
-
Output-token truncation (primary).
gemma-4-26bdid not match theCC_OUTPUT_TOKENSoverride list, so it ran at the 4096 default. A file write blows past that; generation is cut mid-content, the JSON never closes, and no validtool_useblock can be built. HTTP 200, no error — Claude just sees text. -
Fork channel-filter amplifies truncation. The custom fork's
_clean_gemma4_channels(vllm_mlx/api/utils.py) handles a truncated Gemma thought block by deleting everything from an unclosed<|channel>thoughtto end-of-text. A tool call truncated mid-stream leaves exactly such an unclosed opener, so the partial tool call is erased entirely before the parser sees it — turning a possibly recoverable fragment into nothing. -
Fork channel-filter content collision (latent). The same filter is a plain substring/regex pass over the whole accumulated text including the file body. If the file being written itself contains
<|channel>thought/<|channel>(realistic for security-review notes, docs, code about LLMs), a complete tool call can also be destroyed.
The installed engine is confirmed to be the
vitorallo/vllm-mlx fork
(not upstream — verified via the venv's direct_url.json and the fork-only
_clean_gemma4_channels patch), so this is the fork's behaviour, not a wrong
dependency.
Fixed in the fork (vitorallo/vllm-mlx, branch
fix/gemma4-toolcall-safe-and-faildloud):
- Pinned MLX stack.
pyproject.tomlnow pinsmlx==0.31.1/mlx-lm==0.31.1and capsmlx-vlm<0.5.0.mlx 0.31.2breaks GPU streams in worker threads (RuntimeError: There is no Stream(gpu, 1)) andmlx-vlm 0.5.0hard-requires the brokenmlx>=0.31.2; the prior>=floors let a reinstall pull the broken stack. Revisit if upstream mlx-lm fixes the thread bug. - D1 & D2 —
_clean_gemma4_channelsis now tool-call-span-safe. Channel stripping is applied only outside<|tool_call>…<tool_call|>spans (spans kept verbatim, reusing the engine's own_TOOL_CALL_TAGS). A tool call after a truncated/unclosed thought is no longer deleted, and a channel marker inside a file body no longer corrupts the call. With no tool-call markers present, behaviour is byte-for-byte identical to before (Gemma-only, no impact on other consumers). Covered bytests/test_gemma4_toolcall_safety.py. - Fail-loud
--tool-call-truncation-notice(opt-in, default OFF). When a tool call is still truncated by the token cap (JSON never closes), the server returns an explicit "write the file in smaller parts" message instead of silent HTTP-200 text. Model-agnostic; default-off and condition-specific so non-tool / non-truncated paths are unchanged.run.shenables it on thevllm-mlx serveline.
Mitigations (in run.sh):
- Proactive guidance:
run.shpasses--append-system-prompttelling the model up front to write files >~150 lines in sections, so it pre-empts truncation rather than hitting it. CC_OUTPUT_TOKENSdefault raised 4096 → 8192 for all models, with per-run override--out-tokens N(use16384for big files; pair with--safeif a large model then OOMs).server.logis now rotated toserver.log.1instead of truncated, so a failed tool-call session can actually be inspected afterward.
Honest limits: this combination removes the silent failure and the
destroyed-call bugs, but a single artifact larger than the output-token
budget still can't be emitted in one call on a black-box local engine — the
proactive + fail-loud guidance steers the agent to chunk it. For heavy
file-writing, gemma-light or a coder model remains faster/more reliable.
The fix lands in users' venvs once the branch is merged to
claude-code-local-patches (or install.sh is repointed at the branch).
| Symptom | Cause | Fix |
|---|---|---|
| vllm-mlx crashes on startup (TypeError: NoneType) | Using unpatched upstream | ./install.sh installs from our fork which has the fix |
| Model generates text about tools but nothing executes | Using Ollama | Switch to vllm-mlx — Ollama can't produce real tool_use blocks |
Metal GPU OOM crash under load (kIOGPUCommandBufferCallbackErrorOutOfMemory) |
Large model + growing agentic context exceeds RAM | Take the memory_preflight prompt (shrink context / raise GPU limit), or use a 5GB model — see #17 |
| First run hangs at "Waiting for server..." | Multi-GB model still downloading from HuggingFace | It's not hung — a live download progress line now shows; partial downloads resume — see #16 |
| Write/Edit tool call shows then silently does nothing (no error) | Large tool-call output truncated by the token cap (+ fork channel-filter) | --out-tokens 16384, or use gemma-light/a coder model; inspect server.log.1 — see #18 |
| Claude Code asks about "detected custom API key" | Real API key leaking | Use cclocal which unsets real keys |
| "Model does not exist" (404) | Wrong model name | Must use full HuggingFace ID, not "default" |
| Slow responses (30-60s) | Normal for local inference | Context grows each turn — 24K+ tokens at ~8 tok/s |
| Variable | Value | Purpose |
|---|---|---|
ANTHROPIC_BASE_URL |
http://127.0.0.1:8000 |
Point Claude Code at local server |
ANTHROPIC_API_KEY |
not-needed |
Dummy key (real key explicitly unset) |
ANTHROPIC_MODEL |
Full HuggingFace ID | Model identifier |
ANTHROPIC_DEFAULT_*_MODEL |
Same as above | Route all tiers (Opus/Sonnet/Haiku) locally |
CLAUDE_CODE_SUBAGENT_MODEL |
Same as above | Route subagent calls locally |
CLAUDE_CODE_MAX_OUTPUT_TOKENS |
8192 default, --out-tokens N to override |
Output cap; must fit a whole Write/Edit file body (see #18) |
CLAUDE_CODE_ATTRIBUTION_HEADER |
0 |
Prevents KV cache invalidation |
DISABLE_PROMPT_CACHING |
1 |
Local server doesn't support Anthropic caching |
DISABLE_AUTOUPDATER |
1 |
No update checks |
DISABLE_TELEMETRY |
1 |
No telemetry |
DISABLE_ERROR_REPORTING |
1 |
No error reporting |
DISABLE_NON_ESSENTIAL_MODEL_CALLS |
1 |
Reduce background model calls |
| Flag | Purpose |
|---|---|
VLLM_MLX_ENABLE_THINKING=false |
Disable thinking/reasoning tokens |
--kv-cache-quantization |
8-bit KV cache — halves cache memory usage |
--cache-memory-percent 0.35 |
35% of RAM for cache (~8.4GB on 24GB) |
--prefill-step-size 4096 |
Faster time-to-first-token on large prompts |
--stream-interval 4 |
Batch 4 tokens before streaming for throughput |
--timeout 600 |
10 min timeout (default 300s caused disconnects) |
--max-tokens |
Server context window: 32768, or 16384 if the memory preflight shrinks it (see #17) |
--enable-auto-tool-choice --tool-call-parser auto |
Parse model output into structured tool_use blocks (Gemma 4 / Qwen / Mistral / Llama / Nemotron) |
--tool-call-truncation-notice |
Fork flag: on a tool call truncated by --max-tokens, return an explicit "write it in smaller parts" message instead of silent text (see #18) |
| Flag / env | Purpose |
|---|---|
--safe / CCLOCAL_FORCE_MEMCHECK=1 |
Always show the memory-safeguard menu, for any model (see #17) |
--no-mem-check / CCLOCAL_NO_MEMCHECK=1 |
Skip the GPU-headroom preflight prompt (see #17) |
--out-tokens N |
Max output tokens Claude Code requests (default 8192; raise to 16384 for large file writes — see #18) |
iogpu.wired_limit_mb |
Optionally raised via sudo sysctl by preflight option 2; per-session only — reverted on exit (prompts for sudo at shutdown if creds expired), and resets on reboot |
| Flag | Purpose |
|---|---|
--strict-mcp-config |
Ignore global plugins |
--mcp-config mcp-local.json |
Empty config — no plugin tools |
--tools "Bash,Read,..." |
8 essential built-in tools only |
--allowedTools "Bash,Read,..." |
Pre-approve the same 8 tools so auto mode skips the slow per-action safety-classifier call (see #18) |
--append-system-prompt "..." |
Tells the model to build files >~150 lines in incremental Write/Edit calls, pre-empting output-token truncation (see #18) |
claude-code-local/
run.sh # Launcher — starts vllm-mlx + Claude Code
install.sh # Setup — creates .venv, installs vllm-mlx, patches bugs, creates cclocal
mcp-local.json # Empty MCP config (strips plugins for local sessions)
.venv/ # Local Python venv with vllm-mlx (created by install.sh)
.gitignore
README.md
- vllm-mlx — Anthropic-compatible MLX inference server
- Claude Code — Anthropic's CLI for Claude
- Why Claude Code Fails with Local LLMs — Detailed failure analysis
- Claude Code tool flooding issue — 259 tools sent to local models
- Ollama Anthropic Compatibility — Confirmed broken for tool_use
This project would not exist without vllm-mlx by Wayner Barrios — the native Apple Silicon MLX backend that makes real Anthropic tool-use blocks possible on local hardware. If you use vLLM-MLX in your research or project, please cite:
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}