Skip to content

feat: runtime model config from HuggingFace config.json#3

Open
Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Alexintosh:feature/runtime-model-config
Open

feat: runtime model config from HuggingFace config.json#3
Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Alexintosh:feature/runtime-model-config

Conversation

@Alexintosh
Copy link
Copy Markdown

Summary

  • Replace ~54 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json at startup via NSJSONSerialization. Switch between Qwen3.5 models (35B, 122B, 397B) with just --model <path> — no recompilation needed.
  • Add model_manager.py utility to list local compatible models, search HuggingFace for MLX-quantized Qwen3.5 MoE models, download them, and validate compatibility.
  • Update README with compatible models table, model manager docs, --model flag usage, and FLASH_MOE_MODEL env var.

What changed in infer.m

  • ModelConfig struct + load_model_config() parses config.json (architecture, quantization, layer types, RoPE, EOS tokens) and tokenizer.json (think tokens)
  • compute_expert_offsets() derives all expert byte offsets from dimensions + quantization params
  • alloc_tracking_arrays() dynamically allocates all tracking arrays (expert freq, cache state, predictions, layer cache) previously sized by compile-time constants
  • ~960 #define references replaced with cfg.* fields via helper macros (FREQ(), CACHE_SEEN(), PRED_EXPERT(), etc.)
  • MetalCtx buffer arrays converted from fixed-size to dynamically allocated (__strong ARC pointers)
  • Validated: compiles clean, runs 5.03 tok/s on 35B-A3B 4-bit, correct output

Test plan

  • Compile with cd metal_infer && make
  • Run 35B model: ./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "What is 2+2?" --tokens 20
  • Verify config summary printed to stderr on startup
  • Run python model_manager.py --local to list cached models
  • Run python model_manager.py --search to find remote models
  • Test FLASH_MOE_MODEL env var as default model path

🤖 Generated with Claude Code

Alexintosh and others added 7 commits March 20, 2026 22:20
Spec for replacing ~40 hardcoded #define model constants with a
runtime ModelConfig struct populated from HuggingFace config.json,
enabling model switching via --model flag without recompilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing arrays (g_lz4_index, g_pred_experts, g_pred_count,
stack VLAs), full_attn_interval fallback, thread safety invariant,
MODEL_PATH_DEFAULT handling, MAX_BATCH_SLOTS coupling note, and
clarify chat.m needs zero changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ModelConfig struct, compute_expert_offsets(), and
load_model_config() that parses HuggingFace config.json +
tokenizer.json via NSJSONSerialization. Old #defines still present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove ~54 model-specific #define constants and replace ~960
occurrences with cfg.* runtime struct fields. Convert 13 static/
stack arrays to dynamic allocation. Parse config.json + tokenizer.json
at startup via NSJSONSerialization. Expert byte offsets computed from
model dimensions and quantization params.

Switching models now requires only --model flag, no recompilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generalize file header comment to describe multi-model support.
Update startup banner from hardcoded model name to "Flash-MoE"
with dynamic config path display.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e models

Lists local HF-cached models with compatibility check, searches
HuggingFace for compatible Qwen3.5 MoE models (35B-A3B, 122B-A10B,
397B-A17B) with MLX quantization, and supports downloading via
huggingface-cli or huggingface_hub.

Usage:
  python model_manager.py              # list local + remote
  python model_manager.py --local      # local only
  python model_manager.py --search     # remote only
  python model_manager.py --download <repo>
  python model_manager.py --check <path>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compatible models table, model manager usage instructions,
updated quick start with --model flag and FLASH_MOE_MODEL env var,
revised project structure, and generalized architecture description.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fiveangle
Copy link
Copy Markdown

Download complete: : 20.4GB [06:39, 47.6MB/s]              /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/1e20fd8d42056f870933bf98ca6211024744f7ec
Download complete: : 20.4GB [06:39, 51.1MB/s]

Model is compatible with Flash-MoE!
  Architecture:  40 layers, hidden=2048, 256 experts (K=8)
  Quantization:  4-bit, group_size=64
  Expert data:   16.9 GB on disk, 13.5 MB active/token
  Parameters:    ~64B total
  Packed experts: NOT FOUND (run repack_experts.py)
  Weights file:   OK
  Status:        NEEDS PREPARATION (see above)

Next steps:
  1. python repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  2. python metal_infer/extract_weights.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  3. ./metal_infer/infer --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt 'Hello' --tokens 20
~/dev/flash-moe$ python repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
zsh: command not found: python
~/dev/flash-moe$ python3 repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
usage: repack_experts.py [-h] [--index INDEX] [--layers LAYERS] [--dry-run] [--verify-only LAYER]
repack_experts.py: error: unrecognized arguments: --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants