Problem / Background
mlxcel-server and llama.cpp's llama-server both expose a --parallel N flag alongside --ctx-size C for controlling concurrent request slots, but they implement fundamentally different semantics for how context is allocated across those slots. This divergence breaks any downstream client that targets both engines through a unified configuration surface — most notably Backend.AI GO, which emits the same flag values to both engines from a single code path.
For the same invocation --ctx-size C --parallel N:
| Engine |
Per-slot context |
Total KV cache memory |
| llama.cpp / llama-server |
C / N tokens (total context divided across slots) |
roughly proportional to C only (constant in N) |
| mlxcel-server |
C tokens per slot (each slot gets full context) |
roughly proportional to C * N (linear in N) |
Evidence in the mlxcel source
- CLI definition (matches
--parallel byte-for-byte): src/bin/mlx_server.rs:175 — parallel: usize, default 1, env LLAMA_ARG_N_PARALLEL
- Pass-through, no division:
src/server/startup.rs:646 — context_size: startup.ctx_size (1:1, not divided by n_parallel)
- Independent fields in config:
src/server/config.rs:224-225 — pub context_size: usize and pub n_parallel: usize stored separately
- KV cache pool sized at full depth:
src/server/scheduler.rs:392 — let pool_capacity = max_batch_size + max_queue_depth; — each cache allocated at full context_size
- Docs make no mention of context division:
docs/en/getting-started/configuration.md describes --n-parallel only as a concurrency control, and docs/CONTINUOUS_BATCHING.md confirms the continuous-batching design with full-depth per-sequence KV caches
Evidence in llama.cpp (upstream)
llama.cpp computes n_ctx_per_seq = n_ctx / n_seq_max. Documented in:
Why this matters for downstream clients
Backend.AI GO (lablup/backend.ai-go) emits --parallel N to both engines from the same ServerConfig.to_args() code path (src-tauri/src/process/types.rs:996-1002), with a single i18n description in the user-facing UI saying "Total context is divided equally among slots (e.g., 4096 context with 2 parallel slots gives 2048 tokens per slot)" — which is correct for llama-server but a lie for mlxcel-server. Users following the same UI guidance get wildly different memory budgets and per-slot context windows depending on which engine they're using.
Concrete consequences:
- Memory planning is broken across engines — a user setting
--parallel 4 --ctx-size 32768 on a 70B model expects llama.cpp-shaped memory and OOMs on mlxcel.
- Application-level defaults that make sense for llama.cpp (e.g., "set parallel to 2 for agent workloads") become memory-doubling decisions on mlxcel.
- There is no pre-flight memory check in mlxcel (no validation at
src/server/startup.rs startup path; no explicit clamp). The mismatch fails silently at runtime.
Proposed Solution
Align mlxcel's --parallel semantics with llama.cpp's behavior so the same flag value produces equivalent per-slot context windows and equivalent memory footprints across the two engines.
Specifically:
- When
--parallel N is passed alongside --ctx-size C, each active slot's per-sequence context should be C / N, computed as max(1, floor(C / N)) with a sensible minimum floor (e.g., reject configurations where C / N < 512 with a clear error message at startup).
- The KV cache pool at
src/server/scheduler.rs:392 should allocate per-slot caches at the divided depth, not the full depth.
- Total KV memory for a given
(C, N) pair should be roughly constant in N — matching llama.cpp.
Migration / backward compatibility options
This is a behavior change that may break existing mlxcel deployments relying on the current "full context per slot" semantics. Maintainers should choose one of the following — we have no strong preference; we just need parity:
- Direct change with CHANGELOG entry. Accept the break, document in CHANGELOG, bump minor version. Cleanest semantics; mirrors llama.cpp exactly.
- New flag
--ctx-size-per-seq. Make per-slot context size an explicit, separate knob; keep --ctx-size as total when --ctx-size-per-seq is unset, and switch to llama.cpp-style division. Most flexible, most code.
- Compatibility mode flag
--llama-server-compat. Opt-in to llama.cpp semantics; default keeps current behavior with a startup warning that the default will flip in a future release. Safest for existing users.
Recommend (1) or (3) for clean long-term semantics.
Acceptance Criteria
Technical Considerations
Cross-references
Downstream coordination
A separate backend.ai-go memory-guard issue is already filed to add pre-flight memory checks downstream. Once this mlxcel-side change lands, the memory math on the Backend.AI GO side becomes engine-independent and that guard can compute a single number for both engines.
Impact scope
This is a breaking change candidate that affects every downstream embedder of mlxcel-server, not just Backend.AI GO. The migration path (option 1 / 2 / 3 above) is the most important call for maintainers to make first.
Problem / Background
mlxcel-serverand llama.cpp'sllama-serverboth expose a--parallel Nflag alongside--ctx-size Cfor controlling concurrent request slots, but they implement fundamentally different semantics for how context is allocated across those slots. This divergence breaks any downstream client that targets both engines through a unified configuration surface — most notably Backend.AI GO, which emits the same flag values to both engines from a single code path.For the same invocation
--ctx-size C --parallel N:C / Ntokens (total context divided across slots)Conly (constant inN)Ctokens per slot (each slot gets full context)C * N(linear inN)Evidence in the mlxcel source
--parallelbyte-for-byte):src/bin/mlx_server.rs:175—parallel: usize, default 1, env LLAMA_ARG_N_PARALLELsrc/server/startup.rs:646—context_size: startup.ctx_size(1:1, not divided byn_parallel)src/server/config.rs:224-225—pub context_size: usizeandpub n_parallel: usizestored separatelysrc/server/scheduler.rs:392—let pool_capacity = max_batch_size + max_queue_depth;— each cache allocated at fullcontext_sizedocs/en/getting-started/configuration.mddescribes--n-parallelonly as a concurrency control, anddocs/CONTINUOUS_BATCHING.mdconfirms the continuous-batching design with full-depth per-sequence KV cachesEvidence in llama.cpp (upstream)
llama.cpp computes
n_ctx_per_seq = n_ctx / n_seq_max. Documented in:--ctx-sizeis divided by--paralleland cannot be increased? ggml-org/llama.cpp#11681Why this matters for downstream clients
Backend.AI GO (
lablup/backend.ai-go) emits--parallel Nto both engines from the sameServerConfig.to_args()code path (src-tauri/src/process/types.rs:996-1002), with a single i18n description in the user-facing UI saying "Total context is divided equally among slots (e.g., 4096 context with 2 parallel slots gives 2048 tokens per slot)" — which is correct for llama-server but a lie for mlxcel-server. Users following the same UI guidance get wildly different memory budgets and per-slot context windows depending on which engine they're using.Concrete consequences:
--parallel 4 --ctx-size 32768on a 70B model expects llama.cpp-shaped memory and OOMs on mlxcel.src/server/startup.rsstartup path; no explicit clamp). The mismatch fails silently at runtime.Proposed Solution
Align mlxcel's
--parallelsemantics with llama.cpp's behavior so the same flag value produces equivalent per-slot context windows and equivalent memory footprints across the two engines.Specifically:
--parallel Nis passed alongside--ctx-size C, each active slot's per-sequence context should beC / N, computed asmax(1, floor(C / N))with a sensible minimum floor (e.g., reject configurations whereC / N < 512with a clear error message at startup).src/server/scheduler.rs:392should allocate per-slot caches at the divided depth, not the full depth.(C, N)pair should be roughly constant inN— matching llama.cpp.Migration / backward compatibility options
This is a behavior change that may break existing mlxcel deployments relying on the current "full context per slot" semantics. Maintainers should choose one of the following — we have no strong preference; we just need parity:
--ctx-size-per-seq. Make per-slot context size an explicit, separate knob; keep--ctx-sizeas total when--ctx-size-per-seqis unset, and switch to llama.cpp-style division. Most flexible, most code.--llama-server-compat. Opt-in to llama.cpp semantics; default keeps current behavior with a startup warning that the default will flip in a future release. Safest for existing users.Recommend (1) or (3) for clean long-term semantics.
Acceptance Criteria
--parallel N --ctx-size Cresults in per-slot KV cache sized forC / Ntokens (with the chosen migration strategy above)./slotsendpoint (already exposed when--slotsis enabled, persrc/server/startup.rs:647) reports the per-slot context size, not the total.--helptext on the--parallelflag (currently "Number of parallel request slots" atsrc/bin/mlx_server.rs:175) explicitly states the context-division semantics.docs/en/getting-started/configuration.mddocs/ko/docs/CONTINUOUS_BATCHING.mddocs/man/mlxcel-server.1ctx_size / n_parallelfor several(C, N)pairs.ctx_size / n_parallelfalls below a sensible floor (suggested: 512 tokens).(C, N)pair is verified to be roughly constant inN(matching llama.cpp).Technical Considerations
Cross-references
--ctx-sizeis divided by--paralleland cannot be increased? ggml-org/llama.cpp#11681 and Parallelization / Batching Explanation ggml-org/llama.cpp#4130lablup/backend.ai-go:src-tauri/src/process/types.rs:996-1002(arg emission)src/types/modelConfig.ts:363-373(slider)src/components/ModelConfigDrawer/ContextTab.tsx:227-245(control)modelConfig.context.parallelRequests/parallelRequestsDesc(currently llama.cpp-shaped)Downstream coordination
A separate
backend.ai-gomemory-guard issue is already filed to add pre-flight memory checks downstream. Once this mlxcel-side change lands, the memory math on the Backend.AI GO side becomes engine-independent and that guard can compute a single number for both engines.Impact scope
This is a breaking change candidate that affects every downstream embedder of
mlxcel-server, not just Backend.AI GO. The migration path (option 1 / 2 / 3 above) is the most important call for maintainers to make first.