Local LLM inference on Intel Arc A770 16GB using llama.cpp with SYCL backend.
- OpenAI-compatible API (
/v1/chat/completions,/v1/embeddings) - Native tool/function calling for agentic workflows
- Dedicated embedding server (optional, separate llama.cpp instance)
- MCP web search server (optional, no API keys)
- Speech-to-text via whisper.cpp (optional, multilingual)
- SYCL flash attention + fused Gated Delta Net for Qwen3.5
- Runs as systemd user services with auto-restart
- Persistent services survive logout (with lingering enabled)
- GPU: Intel Arc A770 16GB (also works on A750, B580)
- OS: Ubuntu 22.04/24.04 or Debian 12
- Kernel: 6.2+
- Disk: ~30GB free (oneAPI + llama.cpp + model)
git clone <repo-url> ~/intel-gpu-inference
cd ~/intel-gpu-inference
git submodule update --init --recursive
# Install everything (drivers, oneAPI, llama.cpp build, systemd service)
./scripts/install.sh
# Or with optional services
./install.sh --with-mcp # + MCP web search
./install.sh --with-whisper # + speech recognition
./install.sh --with-embedding # + embedding server
./install.sh --all # + everythinginstall.sh handles Intel GPU drivers, oneAPI toolkit, llama.cpp SYCL build, environment config, and systemd service setup. Log out and back in if prompted for group changes.
By default, systemd user services only run while the user is logged in. To start services at boot and keep them running after logout:
sudo loginctl enable-linger $USERWe use Unsloth GGUF quantizations — they work great with llama.cpp thanks to Dynamic 2.0 quants that upcast important layers.
Prefer Q8_0 when the model fits in VRAM, fall back to Q4_0 for larger models. Legacy quants (Q4_0, Q8_0) are significantly faster than K-quants on Intel GPUs due to optimized SYCL MUL_MAT kernels.
| Path | Description |
|---|---|
~/models/ |
GGUF model files |
~/intel-gpu-inference/llama.cpp/ |
llama.cpp source and SYCL build |
~/intel-gpu-inference/open-websearch/ |
MCP web search server (if installed) |
~/intel-gpu-inference/whisper.cpp/ |
whisper.cpp source and SYCL build (if installed) |
~/.config/intel-gpu-inference/env |
Environment config (all services) |
~/.config/systemd/user/ |
Installed systemd unit files |
OpenAI-compatible API serving GGUF models on the Intel Arc GPU.
# Manual
./scripts/run.sh # default model from env config
./scripts/run.sh /path/to/model.gguf # specific model
./scripts/run.sh --ctx 4096 # override context size
# systemd
systemctl --user status llama-server
systemctl --user restart llama-server
journalctl --user -u llama-server -fAPI: http://<host>:8080/v1
Test: ./scripts/test.sh
Multi-engine web search via MCP protocol. No API keys required. Supports DuckDuckGo, Bing, Brave, and others.
# Install
./scripts/install-mcp.sh
# Manual
./scripts/run-mcp.sh
# systemd
systemctl --user status open-websearch
systemctl --user restart open-websearch
journalctl --user -u open-websearch -fEndpoints: http://<host>:3000/sse (SSE) | http://<host>:3000/mcp (streamableHttp)
Test: ./scripts/test-mcp.sh
MCP client config:
{
"mcpServers": {
"web-search": {
"transport": { "type": "sse", "url": "http://<host>:3000/sse" }
}
}
}Tools: search_web, fetchArticle, fetchGithubReadme
A separate llama.cpp instance running in embedding-only mode on port 8085. Keeps embedding workloads isolated from the main inference server.
# Install (appends config to env, installs systemd service)
./scripts/install-embedding.sh
# Manual
./scripts/run-embedding.sh
./scripts/run-embedding.sh /path/to/model.gguf
# systemd
systemctl --user status embedding-server
systemctl --user restart embedding-server
journalctl --user -u embedding-server -fAPI: POST http://<host>:8085/v1/embeddings (OpenAI-compatible)
Test: ./scripts/test-embedding.sh
Model: set EMBEDDING_MODEL in ~/.config/intel-gpu-inference/env
Multilingual speech-to-text via whisper.cpp with SYCL GPU acceleration. Supports Arabic, English, French, Chinese, and 90+ languages.
# Install
./scripts/install-whisper.sh
# Manual
./scripts/run-whisper.sh
./scripts/run-whisper.sh /path/to/model.bin
# systemd
systemctl --user status whisper-server
systemctl --user restart whisper-server
journalctl --user -u whisper-server -fAPI: POST http://<host>:9090/inference (multipart/form-data with audio file)
Test: ./scripts/test-whisper.sh
Model: ggml-large-v3.bin (~3.9GB VRAM)
All services read from ~/.config/intel-gpu-inference/env. Edit and restart the relevant service.
| Variable | Default | Description |
|---|---|---|
DEFAULT_MODEL |
~/models/Qwen3VL-8B-Instruct-Q8_0.gguf |
Active model path |
MMPROJ_PATH |
~/models/mmproj-Qwen3VL-8B-Instruct-F16.gguf |
Vision projector (blank to disable) |
LLAMA_HOST |
0.0.0.0 |
Server bind address |
LLAMA_PORT |
8080 |
Server port |
DEFAULT_SEARCH_ENGINE |
duckduckgo |
MCP search engine |
PORT |
3000 |
MCP server port |
EMBEDDING_MODEL |
(required) | Embedding model path (.gguf) |
EMBEDDING_HOST |
0.0.0.0 |
Embedding server bind address |
EMBEDDING_PORT |
8085 |
Embedding server port |
EMBEDDING_CTX |
8192 |
Embedding context size |
WHISPER_MODEL |
~/models/ggml-large-v3.bin |
Whisper model path |
WHISPER_HOST |
0.0.0.0 |
Whisper server bind address |
WHISPER_PORT |
9090 |
Whisper server port |
WHISPER_LANGUAGE |
auto |
Language: auto, en, ar, fr, zh |
LLAMA_ARG_WEBUI_MCP_PROXY |
false |
Enable MCP proxy in web UI (set true for web UI MCP) |
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS |
1 |
Allow >4GB VRAM allocations |
ONEAPI_DEVICE_SELECTOR |
auto | GPU selection (set if iGPU conflict) |
intel-gpu-inference/
├── install.sh # Top-level installer (service + optional MCP/whisper/embedding)
├── llama-server.service.template # systemd unit template
├── open-websearch.service.template # systemd unit template (MCP)
├── whisper-server.service.template # systemd unit template (whisper)
├── embedding-server.service.template # systemd unit template (embedding)
├── scripts/
│ ├── install.sh # Full build installer (drivers, oneAPI, llama.cpp)
│ ├── install-mcp.sh # MCP web search installer
│ ├── install-whisper.sh # whisper.cpp speech recognition installer
│ ├── install-embedding.sh # embedding server installer
│ ├── run.sh # llama-server launcher
│ ├── run-mcp.sh # MCP web search launcher
│ ├── run-whisper.sh # whisper-server launcher
│ ├── run-embedding.sh # embedding-server launcher
│ ├── test.sh # LLM API test suite
│ ├── test-mcp.sh # MCP server test suite
│ ├── test-whisper.sh # whisper server test suite
│ └── test-embedding.sh # embedding server test suite
├── configs/
│ ├── llama-server.env.template # Environment template
│ ├── open-websearch.env.template # MCP environment template
│ ├── whisper-server.env.template # Whisper environment template
│ └── embedding-server.env.template # Embedding environment template
├── docs/
│ ├── research.md # Evaluation of Intel GPU inference options
│ └── models.md # Model recommendations for 16GB VRAM
├── llama.cpp/ # Submodule: source and SYCL build
├── open-websearch/ # Submodule: MCP web search server
└── whisper.cpp/ # Submodule: speech recognition server
- Unsloth GGUF Models
- Unsloth Dynamic 2.0 Quants
- docs/models.md — VRAM-tested recommendations for Arc A770
- whisper.cpp — C/C++ port of OpenAI Whisper
- Whisper Models — Pre-converted GGML models
- Server API — HTTP inference endpoint
- Model Context Protocol
- open-websearch — Multi-engine search MCP server