This guide is the practical checklist for adding new text models and VLMs to mlxcel.
It points to the concrete control surfaces that must stay consistent. If this repository later adds a maintainer workflow document, keep this checklist aligned with it.
- Keep new model additions predictable.
- Reuse existing control-plane helpers instead of adding new one-off branches.
- Add tests alongside the integration points that are easiest to regress.
- Treat
mlx-lm/mlx-vlmas useful references, not as the only acceptable source for a port.
- Identify the implementation source you will use for the port. This can be
an MLX reference implementation, but it does not have to be one:
mlx-lmtext model implementations when the family already exists there.mlx-vlmimplementations when the family already exists there.- Hugging Face Transformers or an official PyTorch implementation when it is the clearest source of truth for config fields, tensor names, module layout, forward semantics, and processor behavior.
- vLLM or SGLang implementations when production inference behavior is the useful reference, especially for KV-cache layout, paged-attention assumptions, MoE routing, rope scaling, speculative paths, or multimodal request preparation.
- Vendor model repositories, model cards, conversion scripts, or released inference examples when they are the only public source for architecture quirks.
- No complete reference implementation, when you only have a checkpoint,
config.json, tokenizer/processor files, a paper, or partial vendor notes. This is acceptable, but it should be treated as a reconstruction task with tighter validation checkpoints. - If local
references/checkouts are not present, clone or inspect the relevant upstream repositories separately. Do not vendor those repositories into this tree.
- Decide whether the architecture is:
- A brand new model family
- A format alias of an existing family
- A VLM wrapper around an existing text model
- Check whether an existing loader helper already matches the new model:
src/model_metadata.rssrc/loading/mod.rssrc/loading/vlm.rssrc/models/mod.rs- Start with the converged registration surface in
src/model_metadata.rs. Standard text models should extend that registration table first, becausesrc/loading/config_backed.rsnow consumes the same source of truth.
- Check whether the change also touches shared execution policy:
src/execution/runtime.rsfor device/environment behaviorsrc/execution/sampling.rsfor user-facing sampling defaults and greedy-vs-sampled assembly
The goal is not to mechanically translate one Python file into Rust. The goal
is to identify the model contract that mlxcel must implement: config
normalization, tensor naming, graph topology, cache semantics, prompt or image
preparation, and generation behavior.
Prefer the reference that is closest to the question you are answering:
- MLX references (
mlx-lm,mlx-vlm) are usually the fastest path when they already support the model, because checkpoint loading, quantization conventions, and MLX tensor behavior tend to matchmlxcelclosely. - Hugging Face Transformers or official PyTorch code is often the architecture source of truth. Use it to confirm module shapes, config field names, activation order, normalization placement, rotary embedding behavior, tied embeddings, and processor/tokenizer conventions.
- vLLM and SGLang are useful production inference references. Use them to
understand serving-time details such as KV-cache shape and update policy,
paged attention assumptions, MoE expert routing, prefix caching, speculative
decode behavior, and multimodal batching constraints. Map those ideas onto
existing
mlxcelcache/runtime abstractions instead of importing their scheduler or CUDA-specific structure directly. - Other PyTorch inference engines or vendor examples can be the best source for model-specific quirks, especially when the model has not landed in MLX or Transformers yet. Capture the exact commit or release used for validation in the PR description or benchmark notes.
When references disagree, record which behavior is authoritative for the checkpoint you are adding. For example, a model card may document the chat template, Transformers may define the tensor/module contract, and vLLM may show the serving-time cache layout. Keep those responsibilities separate while porting.
Do not copy reference-code boundaries blindly:
- Keep route selection in
src/model_metadata.rsandsrc/loading/. - Keep prompt, media, and processor policy in the existing multimodal helpers.
- Keep serving behavior in the shared execution/server layers.
- Preserve
mlxcelnaming and test conventions even when the reference uses a different file layout.
Some model additions start from a checkpoint and metadata rather than a working inference implementation. In that case, make the first PR a conservative loader/runtime reconstruction rather than a broad family port.
Use all available artifacts as partial references:
config.json,generation_config.json, tokenizer files, processor files, and chat templates.- SafeTensors key names, tensor shapes, quantization metadata, and tied-weight relationships.
- Model card notes, architecture diagrams, paper equations, release examples, and conversion scripts.
- The nearest existing family in
mlxcel,mlx-lm, Transformers, vLLM, SGLang, or another PyTorch inference engine.
Recommended workflow:
- Inspect the checkpoint first. Build a tensor-name and shape inventory before writing model code, and compare it with the nearest existing family.
- Identify the minimum viable path: text-only before VLM, single-device before tensor/pipeline parallelism, greedy decode before advanced sampling behavior.
- Add explicit config normalization for every inferred default. Do not hide guessed defaults inside the model constructor.
- Keep unsupported variants out of detection until they are validated with a real checkpoint.
- Add shape/config tests even before numerical parity is available.
- Run a real smoke test and record the prompt, generated token count, and any known limitations in the PR or benchmark notes.
Validation expectations are different without a reference. Exact parity may not be possible at first, but the implementation should still prove that:
- all required tensors are consumed or intentionally ignored
- tensor shapes match the reconstructed graph
- cache updates advance correctly across prefill and decode
- generation is stable for at least one real checkpoint
- failures are explicit for unsupported configs instead of silently falling through to a wrong family
If later a reference implementation appears, add a follow-up comparison against that implementation and tighten the tests or benchmark notes accordingly.
- Add the implementation file under
src/models/. - Register the module and re-export in
src/models/mod.rs. - Add a
ModelTypevariant insrc/models/mod.rs. - Extend
get_model_type()insrc/models/detection.rs.- Prefer shared helpers such as
detect_text_or_vlm()anddetect_hunyuan_model_type()when the new model fits an existing pattern.
- Prefer shared helpers such as
- Add the corresponding
LoadedModelvariant insrc/loaded_model.rs.- Prefer extending the existing dispatch helpers instead of adding new repeated match tables:
delegate_language_model!insrc/loaded_model.rsandVlmRuntimeRefinsrc/loaded_model_capabilities.rs
- Prefer extending the existing dispatch helpers instead of adding new repeated match tables:
- Wire loading in
src/loading/mod.rs.- Prefer existing helpers like
load_pair_from_dir()andload_owned_model_from_config!. - Update
src/model_metadata.rsso kind, adapter support, route selection, and standard config-backed registration stay centralized before touching the router. - If the model follows the standard text-model path, extend the shared
registration surface in
src/model_metadata.rsinstead of adding a parallel entry list insrc/loading/config_backed.rs.
- Prefer existing helpers like
- If LoRA/adapters are supported, verify
load_model_from_weights()insrc/loading/mod.rs.- Non-standard adapter paths should extend
src/loading/special.rsinstead of growingload_model_from_weights()directly.
- Non-standard adapter paths should extend
- Implement or reuse the vision encoder under
src/vision/encoders/. - Implement or reuse the connector under
src/vision/connectors/. - Implement or reuse the processor under
src/vision/processors/. - Add the VLM
ModelTypedetection insrc/models/detection.rs.- If the base text family has both text-only and VLM variants, prefer
detect_text_or_vlm().
- If the base text family has both text-only and VLM variants, prefer
- Add the loader entry in
src/loading/vlm.rsor the matching family module undersrc/loading/.- Prefer shared helpers for config parsing, token defaults, and weight remapping.
- Keep family-specific assembly grouped with its nearest peers:
src/loading/vlm_qwen.rs,src/loading/vlm_llava.rs,src/loading/vlm_gemma.rs,src/loading/vlm_pixtral.rs,src/loading/vlm_siglip.rs,src/loading/vlm_special.rs. - Update
src/model_metadata.rsso the router knows the family is multimodal and adapter loading policy remains explicit. - Register the directory entry point in
try_load_vlm_model_from_dir()insrc/loading/mod.rssoload_model()stays as a thin dispatcher.
- Add or reuse the
LoadedModelcapability helpers insrc/loaded_model_capabilities.rs.- Prefer extending
VlmRuntimeRefor an existing multimodal helper over adding family-specific getters that only CLI/server use.
- Prefer extending
- Reuse prompt helpers where possible:
- Qwen-VL token insertion:
src/multimodal/qwen_vl.rs - Generic image-token block expansion:
src/multimodal/vlm_prompt.rs - Phi3V prompt tag handling:
src/multimodal/phi3v_prompt.rs
- Qwen-VL token insertion:
Do not create a new file by default. Create one when the family has a distinct control-plane identity.
Create a new loader family module when:
- config normalization is not a small variant of an existing family
- token defaults or weight-key remapping need dedicated tests
- the VLM wrapper uses a different prompt/runtime assembly path
- adding the logic inline would obscure an existing family boundary
Keep the model in an existing module when:
- it is primarily an alias or small config delta
- the same loader tests already express the policy
- the family is still recognizable after the change
If you are unsure, extend the existing family module first and split only when the test file or router starts to lose a clear boundary.
src/models/mod.rs- Missing module export or
ModelTypevariant
- Missing module export or
src/models/detection.rs- Missing aliases in
get_model_type() - Text/VLM misclassification when
vision_configis present
- Missing aliases in
src/loading/mod.rs- Divergence between
load_model()andload_model_from_weights() - Adding a standard config-backed model as a one-off special case instead of the shared loader helpers
- Adding a VLM directly into
load_model()instead oftry_load_vlm_model_from_dir()
- Divergence between
src/model_metadata.rs- Forgetting to update text/VLM kind, adapter support, route policy, or the shared standard-text registration entry before wiring loaders
src/loading/config_backed.rs- Bypassing the shared registration surface and adding new one-off loader logic
- Forgetting wrapper constructors for models such as
Llama4,Gemma3, orMinistral3
src/loading/nonstandard.rs- Leaving directory-only loader families in
src/loading/mod.rsinstead of the non-standard registry
- Leaving directory-only loader families in
src/loading/special.rs- Adding adapter/owned-weight special handling inline instead of the special-weight registry
- Forgetting Qwen3.5 text-config normalization or owned-weight sanitization before construction
Keep src/loading/mod.rs focused on route selection. If a new model family adds
substantial construction logic, prefer a dedicated sibling module and call it
from the router instead of growing load_model() or load_model_from_weights()
inline.
For very large model families, extract internal helper hotspots into a focused sibling helper module when the code changes for different reasons than the main decoder stack. Current examples:
src/models/gemma3n_helpers.rssrc/models/llama4_helpers.rssrc/loading/vlm.rsand sibling family modules undersrc/loading/- Wrong default token IDs
- Missing top-level quantization inheritance
- Incorrect weight-key remapping between text and vision towers
src/loaded_model.rs/src/loaded_model_capabilities.rs- Missing dispatch arm for a new variant
- Missing capability wiring in
VlmRuntimeRef - Updating the all-model dispatch macro but forgetting the multimodal capability switchboard
Add tests in the same slice as the model/control-plane change.
- For model detection helpers:
- Add unit tests near
src/models/detection.rs
- Add unit tests near
- For sanitization helpers:
- Add unit tests near
src/models/sanitize.rs
- Add unit tests near
- For loader normalization or token-default logic:
- Add tests near
src/loading/tests.rsor the relevantsrc/loading/vlm*_tests.rs
- Add tests near
- For shared vision merge contracts:
- Add tests near
src/vision/merge_tests.rs
- Add tests near
- For prompt/token expansion logic:
- Add tests in dedicated helper test files such as
src/multimodal/qwen_vl_tests.rs,src/multimodal/vlm_prompt_tests.rs,src/multimodal/phi3v_prompt_tests.rs
- Add tests in dedicated helper test files such as
- For runtime validation:
- Run
scripts/run_quality_gate.sh - Add at least one local smoke test when a matching model exists
- If the slice touched MLX-heavy ignored helper tests, run them explicitly
with
--ignored --test-threads=1
- Run
When touching shared functions used by multiple model families, update the local usage comments in the shared helper files, especially under:
src/lib/mlxcel-core/src/layers.rssrc/lib/mlxcel-core/src/utils.rs
Those comments act as the retest list for future changes.
Keep entry-point policy in the shared execution layer when the behavior must be identical across CLI, server, and future frontends.
src/execution/runtime.rs- Environment-driven device selection (
MLXCEL_DEVICE) - GPU wired-memory limit setup
- Environment-driven device selection (
src/execution/sampling.rs- Centralized
SamplingConfigassembly from resolved request defaults - Shared greedy vs non-greedy branching
- Centralized
If a new frontend or request type needs different defaults, resolve those
defaults at the edge and keep the final conversion in src/execution/.
Keep CLI-only prompt formatting and terminal output behavior in
src/commands/generate.rs instead of moving it into shared loading or server
modules.
For server-only boot behavior, keep startup policy in src/server/startup.rs
instead of growing src/server/mod.rs:
- API key / chat-template resolution precedence
- startup-time normalization of CLI-compatible flags
- warmup behavior
- Unix-socket vs TCP binding
Keep shared server types in the focused modules as well:
src/server/config.rsfor request/default configuration structssrc/server/state.rsforAppStateand metrics containerssrc/server/model_provider.rsfor the public request/response channel APIsrc/server/model_worker.rsfor the long-lived worker thread, VLM request prep, and decode state
Keep server edge adapters out of the route files once more than one endpoint needs the same behavior:
src/server/chat_request.rsfor OpenAI chat message flattening and prompt fallback assemblysrc/server/request_options.rsfor request-default merging intoServerGenerateOptionssrc/server/media.rsfordata:/file://image-source parsingsrc/server/streaming.rsfor shared SSE channel and[DONE]emission helpers
This section validates that the current architecture actually reduces ambiguity when adding new model support.
Assume a new text model that follows the existing config-backed loading path.
Required surfaces today:
src/models/<family>.rssrc/models/mod.rssrc/models/detection.rssrc/model_metadata.rsthrough the converged registration surface plusstatic_model_descriptor()/model_load_policy()src/loading/config_backed.rsonly if shared config-backed loading behavior itself must changesrc/loaded_model.rssrc/loaded_model_capabilities.rsonly if the family changes multimodal capability exposure- tests near
src/models/detection_tests.rs,src/models/sanitize_tests.rs, andsrc/loading/tests.rs
What should not happen:
- no new one-off construction branch inside
load_model() - no direct CLI or server changes unless user-visible behavior changes
- no family-specific getter added to
LoadedModelif an existing capability is enough
Why this is better than the old path:
- route selection is centralized instead of duplicated across multiple loading matches
- adapter support is declared in one policy surface
- standard text constructor registration no longer lives in a separate parallel table
- the expected edit list is short enough to review before coding starts
Assume a new VLM family needs its own token defaults and weight-key remapping.
Required surfaces today:
- text model and/or VLM wrapper under
src/models/orsrc/vision/ src/models/mod.rssrc/models/detection.rssrc/loading/vlm_<family>.rssrc/loading/vlm.rssrc/model_metadata.rsthroughstatic_model_descriptor()/model_load_policy()src/loaded_model.rsthrough enum wiringsrc/loaded_model_capabilities.rsthroughVlmRuntimeRefsrc/multimodal/only if prompt/runtime preparation is truly new- tests near
src/loading/vlm_<family>_tests.rsand any new multimodal helper test file
What should not happen:
- no concrete model-type checks added to CLI or server request paths
- no family-specific loading logic added directly to
src/loading/mod.rs - no duplicated prompt-rewrite logic across CLI and server
Why this is better than the old path:
- the family router lives in
src/loading/vlm.rs - family assembly stays beside peer VLM loaders
- multimodal frontends depend on capabilities, not family names
Before opening a PR for a new model or VLM family, confirm:
- loading policy was updated through
src/model_metadata.rs LoadedModelwiring stayed insidesrc/loaded_model.rsandsrc/loaded_model_capabilities.rsrather than creating a new one-off family getter- CLI and server still depend on shared helpers rather than the concrete model type
- unit tests cover the new policy or normalization logic
- at least one smoke test exists when a local model is available
Prefer small checkpoints that isolate one control-plane surface:
- model detection
- loader normalization
- prompt preparation
- runtime initialization
This keeps regressions searchable and makes future model additions easier to compare against previous slices.