Reference for distributing model layers across GPUs and CPU.
Status update (2026-02-23): The recommended approach changed from explicit
-otlayer assignments to--fitwith--n-gpu-layers auto. FIT automatically distributes layers across CUDA0, CUDA1, and CPU — including MoE expert offload. The old-otapproach is documented below for reference but is no longer used inmodels.conf. See issue #19816 for the discovery that motivated this change anddocs/lessons_learned.mdlesson #7 for the root cause (hardcodedN_GPU_LAYERS=99prevented FIT from working).
Large language models often don't fit on a single GPU. Getting them to run well involves three steps, in this order:
-
Make it fit. Not every model fits on one GPU. You may need to split across multiple GPUs, offload parts to CPU, or choose a smaller quantization. These are trade-offs — understand what you're giving up.
-
Make it fit your needs. Context window size has a big impact on VRAM usage. For benchmarking (short prompts), 10K context is plenty and frees VRAM for more model layers on GPU. For chat or assistant use, you want as much context as possible (64K-256K). Match the configuration to the actual use case.
-
Optimize for speed. Once it fits, tune the layer distribution for maximum performance. This depends on whether the model is dense or MoE, and for MoE models it can get complex (see graph splits). This guide and the
gpu-optimizeragent help find the best balance for each case.
| Device | VRAM | Role |
|---|---|---|
| CUDA0 — RTX 4090 | 24 GB | Primary compute, fastest. Nothing else runs here. |
| CUDA1 — RTX 5070 Ti | 16 GB (~12.5 usable) | Secondary. Runs display/OS. |
| CPU | 64 GB DDR4 | Slowest. Large capacity for expert offload. |
Priority: Always fill CUDA0 first → CUDA1 → CPU.
Read the model card in models/documentation/ before any GPU decisions. Check:
- Dense or MoE? If MoE: expert count, active per token, shared experts.
- Total vs active parameters. Number of layers.
- Special features (SWA, DeltaNet, hybrid).
Never assume architecture from model name or file size.
Check actual file size (ls -lh), then add:
- Model weights: file size minus ~200 MiB (metadata/embeddings)
- KV cache: depends on context, KV layer count, cache type (q8_0)
- Compute buffers: depends on
-b/-uband context size - CUDA overhead: ~500 MiB per active GPU
Does the model fit on CUDA0 alone?
├── YES → Strategy A (single GPU, fastest)
│ EXTRA_ARGS: --split-mode none --main-gpu 0
│
└── NO → Use FIT auto (current standard approach)
FIT automatically distributes across CUDA0 + CUDA1 + CPU.
No -ot, no N_GPU_LAYERS=99, no FIT=off needed.
The core principle: keep everything on GPU when possible. GPU memory bandwidth (~1 TB/s) is 30x faster than PCIe to CPU (~32 GB/s). Only offload to CPU when GPU VRAM is genuinely insufficient. FIT applies this principle automatically.
Current approach (2026-02-23 onward): use --fit for all profiles. FIT is
on by default in docker-compose.yml and handles GPU/CPU distribution automatically.
--n-gpu-layers auto (also default) lets FIT decide how many layers go to GPU.
The old -ot explicit placement approach required FIT=off and N_GPU_LAYERS=99.
It was replaced after discovering that N_GPU_LAYERS=99 prevents FIT from working
(see issue #19816). The old strategies (C and D) are documented below for reference
and historical context, but are not used in current models.conf profiles.
All weights on the 4090. No inter-device transfers, no graph split overhead.
FIT is on (default), but --split-mode none prevents distribution to CUDA1.
EXTRA_ARGS=... --split-mode none --main-gpu 0
--split-mode none in EXTRA_ARGS overrides the docker-compose default
--split-mode layer (last flag wins).
When: total VRAM footprint < ~23 GB. Example: GLM-4.7-Flash Q4_K_M at 10K context (~17.5 GB).
FIT distributes layers across CUDA0, CUDA1, and CPU based on available VRAM. For MoE models where total weights exceed GPU VRAM, FIT automatically offloads expert tensors to CPU while keeping attention layers on GPU.
# No special flags needed — FIT=on and --n-gpu-layers auto are defaults
EXTRA_ARGS=... --jinja -np 1 <sampler flags>
When: any model that needs more than one device (dense or MoE).
Result (Qwen3-Next 262K): ~33 t/s, 55 graph splits, CUDA0 ~20 GB,
CUDA1 ~8 GB, CPU ~53 GB experts. Outperforms the old manual -ot approach
(26.5 t/s, 136 graph splits) for the same model.
Important for asymmetric GPU setups: Use FIT_TARGET to tune the VRAM
headroom margin per device. The default FIT_TARGET is a single value applied
to all devices equally. On this hardware (CUDA0 dedicated, CUDA1 shares with
OS/display), FIT_TARGET=128,1024 is set as the default in docker-compose.yml:
128MiB headroom for CUDA0 (RTX 4090) — dedicated GPU, nothing else running1024MiB headroom for CUDA1 (RTX 5070 Ti) — shares ~3 GB with OS/display
This per-device setting allows FIT to use more of the dedicated GPU's VRAM. Without it, the uniform default was too conservative for CUDA0, leaving VRAM unused and pushing layers to CPU unnecessarily. Example impact on GLM-4.7 Flash Q8:
- Without tuned FIT_TARGET: ~105 t/s, 33 graph splits
- With FIT_TARGET=128,1024: ~112 t/s, 5 graph splits
These approaches are documented below for reference only. They required FIT=off
and explicit -ot regex rules. They are no longer used because FIT auto produces
equal or better results without the complexity and without the N_GPU_LAYERS=99 bug.
Strategy B — Dense model across both GPUs:
# Historical — not used in current profiles
EXTRA_ARGS=... --tensor-split 3,1 # 75% CUDA0, 25% CUDA1
Strategy C — MoE model across both GPUs (all experts on GPU):
# Historical — not used in current profiles
EXTRA_ARGS=... -ot blk\.RANGE0\.=CUDA0,blk\.RANGE1\.=CUDA1
Strategy D — MoE model with CPU expert offload:
# Historical — not used in current profiles
EXTRA_ARGS=... -ot blk\.RANGE0\.=CUDA0,blk\.RANGE1\.=CUDA1,exps=CPU
How -ot priority works (for reference): rules are evaluated left to right,
first match wins. Layers matching CUDA0/CUDA1 rules kept ALL tensors (attention
- experts) on GPU. For remaining layers,
exps=CPUoffloaded expert weights while-ngl 99kept attention on GPU.
Why experts specifically? Experts are used partially per token (e.g., 4/64 = 6% for GLM). Attention is used every token. So when you must offload something, experts cost the least performance. FIT applies the same logic automatically.
Layer-split mode is sequential, not parallel: CUDA0 computes its layers → transfers the result to CUDA1 → CUDA1 computes its layers.
total_time = CUDA0_time + transfer_time + CUDA1_time
This means more layers on the faster GPU genuinely helps — the 4090 processes each layer faster, so giving it more work reduces total time.
A graph split is a contiguous chunk of computation that runs on one device. The scheduler creates a new split every time it encounters an operation on a different device. Each split boundary costs time (data copy + synchronization).
Check sched_reserve: graph splits = N in startup logs. Lower is better.
Dense models are predictable: a 2-GPU split creates ~2-3 graph splits total. Moving the split point doesn't change the count.
MoE models are unpredictable: each layer has complex operations (attention → router → expert dispatch → compute → combine → shared experts). Moving the GPU boundary by even 1-2 layers can significantly change the split count because the scheduler maps operations to backends differently depending on where the cut falls. There is no formula for this — you have to test and check the logs.
The table below shows historical manual -ot split data (Strategy C, now superseded)
alongside the current FIT auto result for comparison.
| Configuration | Graph splits | Speed | Notes |
|---|---|---|---|
Manual -ot: 35 CUDA0 + 12 CUDA1 |
33 | ~105 t/s | Former sweet spot (Strategy C) |
Manual -ot: 37 CUDA0 + 10 CUDA1 |
53 | ~102 t/s | Slower — extra splits outweigh GPU benefit |
| FIT auto, default FIT_TARGET | 33 | ~105 t/s | FIT matched manual result |
| FIT auto, FIT_TARGET=128,1024 | 5 | ~112 t/s | Current default — fewer splits, faster |
Note: With --fit (current default), the split is handled automatically.
The primary tuning knob for FIT is FIT_TARGET — set per-device headroom in
docker-compose.yml to match your hardware's actual available VRAM. See the
Strategy FIT section above for the asymmetric GPU example. The guidance below
applies if you are tuning a manual -ot split for historical reference.
Goal: fewest graph splits + most layers on fastest GPU.
These two goals can conflict (as the example shows), so:
- Start with a reasonable split based on VRAM math
- Test and note the graph split count from startup logs
- Try ±1-2 layers, compare graph splits and speed
- Pick the split with the lowest graph split count that fits
- Among equal split counts, prefer more layers on the faster GPU
For MoE models, certain split points produce cleaner boundaries than others.
This depends on the specific model architecture and is not predictable in advance.
FIT with auto placement produced 55 graph splits on Qwen3-Next vs 136 with the
manual -ot configuration — a significant improvement.
| Model | Type | Params | Active/token | Experts | Layers | Files | Status |
|---|---|---|---|---|---|---|---|
| GLM-4.7-Flash | MoE | 30B | 3B | 64/layer, 4+1 shared | 47 (1 dense + 46 MoE) | Q4: 18 GB, Q8: 30 GB | active |
| Qwen3.5-35B-A3B | MoE | 35B | 3B | 256/layer, 8 routed + 1 shared | 40 (75% DeltaNet) | Q6: 29 GiB | active |
| Qwen3.5-122B-A10B | MoE | 122B | 10B | 256/layer, 8 routed + 1 shared | 48 (75% DeltaNet) | Q4: 65 GiB | active |
| Qwen3.5-27B | Dense | 27B | 27B | none (FFN, not MoE) | 64 (75% DeltaNet) | Q6: 22 GiB, Q8: 31 GiB | active (Q6) |
| GPT-OSS 120B | MoE | 116.8B | 5.1B | 128/layer, 4 active | 36 (18 SWA) | F16: 61 GB | retired 2026-02-26 |
| Qwen3-Coder-Next | MoE | 80B | 3B | 512/layer, 10 active + 1 shared | 48 (75% DeltaNet) | Q5: 57 GB, Q6: 64 GB | retired 2026-02-26 |
| Qwen3-Next-80B-A3B | MoE | 80B | 3B | 512/layer, 10 active + 1 shared | 48 (75% DeltaNet) | Q5: 53 GB | retired 2026-02-26 |
-b (batch) and -ub (micro-batch) are independent parameters:
-b(logical batch): How many tokens are scheduled per prompt processing step. Affects prompt ingestion speed. Has minimal direct VRAM impact.-ub(micro-batch / physical batch): How many tokens the GPU computes at once within a batch. This determines the compute buffer size in VRAM.
These can be set independently. -b 2048 -ub 512 gives fast prompt processing
(2048 tokens per step) with a small compute buffer (sized for 512 tokens).
Defaults (llama.cpp server and Ollama):
-b 2048(llama.cpp server default; Ollama uses 512-1024)-ub 512(universal default in both llama.cpp and Ollama)
Rule of thumb: Always use -ub 512. There is no meaningful performance
penalty — the same work is done in more micro-batches, but the speed difference
is negligible for interactive use. The VRAM savings are significant:
-ub value |
Compute buffer (typical) | VRAM vs -ub 512 |
|---|---|---|
| 512 | ~448 MiB | baseline |
| 1024 | ~897 MiB | +449 MiB wasted |
| 2048 | ~1,500-2,400 MiB | +1,000-2,000 MiB wasted |
Measured on GLM-4.7-Flash Q8_0. Exact sizes vary by model hidden dimension.
Production recommendation: -b 2048 -ub 512 for all profiles. Omitting
-ub is fine since 512 is already the default, but explicit is clearer.
Benchmark recommendation: -b 512 -ub 512 — HumanEval prompts are ~400
tokens, so the full prompt fits in one batch. No need for a larger -b.
When to increase -b: Only if you routinely paste very large documents
(50K+ tokens) in a single message and the prompt ingestion wait is noticeable.
Going from -b 2048 to -b 4096 processes prompts in fewer chunks. The VRAM
impact of -b alone is minimal — it only controls scheduling, not GPU buffers.
Not like embedding chunking: In RAG/embedding pipelines, document chunks are
processed independently, so you need overlap to preserve context at boundaries.
Prompt batching in llama.cpp is different — chunks are processed sequentially
into the same KV cache. After chunk 1 is processed, its full attention state is
stored. Chunk 2 attends to all previous tokens via the KV cache. No information
is lost at chunk boundaries, no overlap is needed. -b is purely a performance
knob — the end result is identical regardless of chunk size.
Common mistake: Setting -b X -ub X (same value for both). This wastes
VRAM on a larger compute buffer without any benefit. The only reason to increase
-ub above 512 is if profiling shows a measurable prompt processing bottleneck,
which is rare in practice.