fix: align parallel context sizing with slots#69
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements issue #57 by aligning server parallel context allocation with llama.cpp server semantics:
--ctx-size Cas a total context budget shared by active request slots.floor(C / N)for--parallel N.--max-batch-size Mas the divisor because it is the actual concurrent decode-sequence limit.--no-batchon single-slot semantics so it receives the full explicit context budget.--max-kv-size./slotsand/health.context_size.serve --estimate-memoryso the preflight uses the same per-slot context and active-sequence count as runtime startup.docs/environment-variables.md, andCHANGELOG.md.Note: the issue references stale doc paths (
docs/en/...,docs/ko/...,docs/CONTINUOUS_BATCHING.md,docs/man/...) that do not exist in the current tree, so the operator-facing docs were added to the existing environment/flag reference instead.Validation
cargo fmt --all -- --checkcargo clippy -p mlxcel --lib --bin mlxcel --bin mlxcel-server --tests -- -D warningscargo test -p mlxcel --lib contextcargo test -p mlxcel --lib build_server_configcargo test -p mlxcel --bin mlxcel serve_preflightgit diff --checkNotes
Full
cargo test -p mlxcel --libis still known to fail on the pre-existing NVFP4 sanitize test that was reproduced onorigin/mainduring the preceding issue #52 audit, so this PR uses focused coverage for the changed behavior plus clippy/fmt.