feat: runtime model config from HuggingFace config.json by Alexintosh · Pull Request #3 · danveloper/flash-moe

Alexintosh · 2026-03-20T23:17:13Z

Summary

Replace ~54 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json at startup via NSJSONSerialization. Switch between Qwen3.5 models (35B, 122B, 397B) with just --model <path> — no recompilation needed.
Add model_manager.py utility to list local compatible models, search HuggingFace for MLX-quantized Qwen3.5 MoE models, download them, and validate compatibility.
Update README with compatible models table, model manager docs, --model flag usage, and FLASH_MOE_MODEL env var.

What changed in `infer.m`

ModelConfig struct + load_model_config() parses config.json (architecture, quantization, layer types, RoPE, EOS tokens) and tokenizer.json (think tokens)
compute_expert_offsets() derives all expert byte offsets from dimensions + quantization params
alloc_tracking_arrays() dynamically allocates all tracking arrays (expert freq, cache state, predictions, layer cache) previously sized by compile-time constants
~960 #define references replaced with cfg.* fields via helper macros (FREQ(), CACHE_SEEN(), PRED_EXPERT(), etc.)
MetalCtx buffer arrays converted from fixed-size to dynamically allocated (__strong ARC pointers)
Validated: compiles clean, runs 5.03 tok/s on 35B-A3B 4-bit, correct output

Test plan

Compile with cd metal_infer && make
Run 35B model: ./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "What is 2+2?" --tokens 20
Verify config summary printed to stderr on startup
Run python model_manager.py --local to list cached models
Run python model_manager.py --search to find remote models
Test FLASH_MOE_MODEL env var as default model path

🤖 Generated with Claude Code

Spec for replacing ~40 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json, enabling model switching via --model flag without recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add missing arrays (g_lz4_index, g_pred_experts, g_pred_count, stack VLAs), full_attn_interval fallback, thread safety invariant, MODEL_PATH_DEFAULT handling, MAX_BATCH_SLOTS coupling note, and clarify chat.m needs zero changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds ModelConfig struct, compute_expert_offsets(), and load_model_config() that parses HuggingFace config.json + tokenizer.json via NSJSONSerialization. Old #defines still present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove ~54 model-specific #define constants and replace ~960 occurrences with cfg.* runtime struct fields. Convert 13 static/ stack arrays to dynamic allocation. Parse config.json + tokenizer.json at startup via NSJSONSerialization. Expert byte offsets computed from model dimensions and quantization params. Switching models now requires only --model flag, no recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Generalize file header comment to describe multi-model support. Update startup banner from hardcoded model name to "Flash-MoE" with dynamic config path display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e models Lists local HF-cached models with compatibility check, searches HuggingFace for compatible Qwen3.5 MoE models (35B-A3B, 122B-A10B, 397B-A17B) with MLX quantization, and supports downloading via huggingface-cli or huggingface_hub. Usage: python model_manager.py # list local + remote python model_manager.py --local # local only python model_manager.py --search # remote only python model_manager.py --download <repo> python model_manager.py --check <path> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add compatible models table, model manager usage instructions, updated quick start with --model flag and FLASH_MOE_MODEL env var, revised project structure, and generalized architecture description. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fiveangle · 2026-03-27T15:51:23Z

Download complete: : 20.4GB [06:39, 47.6MB/s]              /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/1e20fd8d42056f870933bf98ca6211024744f7ec
Download complete: : 20.4GB [06:39, 51.1MB/s]

Model is compatible with Flash-MoE!
  Architecture:  40 layers, hidden=2048, 256 experts (K=8)
  Quantization:  4-bit, group_size=64
  Expert data:   16.9 GB on disk, 13.5 MB active/token
  Parameters:    ~64B total
  Packed experts: NOT FOUND (run repack_experts.py)
  Weights file:   OK
  Status:        NEEDS PREPARATION (see above)

Next steps:
  1. python repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  2. python metal_infer/extract_weights.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  3. ./metal_infer/infer --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt 'Hello' --tokens 20
~/dev/flash-moe$ python3 repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
usage: repack_experts.py [-h] [--index INDEX] [--layers LAYERS] [--dry-run] [--verify-only LAYER]
repack_experts.py: error: unrecognized arguments: --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit

Phase 0 of the ANE offload strategy scoped in docs/2026-04-11-ane-offload-scoping.md. Adds two measurement harnesses under ane_bench/ and a README documenting the methodology and results. Harnesses: - ane_dispatch_bench.m: loads one pre-compiled LUT4 super-block from anemll-qwen35 and runs 200 back-to-back predictionFromFeatures: calls, reporting min/p50/p90/p99/max latency. - ane_transfer_bench.m: 10k trials each of four MLMultiArray/MTLBuffer interaction patterns (zero-copy wrap, alloc+memcpy, readback, alloc only). Results on M4 Pro mini-01 (2026-04-11): Unknown danveloper#1 — Swift CoreML per-prediction dispatch overhead: p50: 6.642 ms per super-block (4 layers bundled: DDDA) p99: 7.444 ms, max: 7.586 ms across 200 calls This is FASTER than the anemll-qwen35 reference of 9.28 ms (their measurement included Python coremltools overhead; raw Obj-C is tighter). No thermal throttling visible. Unknown danveloper#2 — MTLBuffer ↔ MLMultiArray transfer cost: Zero-copy wrap (initWithDataPointer): 0.5 us mean, p99 1.07 us Alloc + memcpy (naive fallback): 1.05 us mean, p99 4.05 us All paths sub-5-microseconds. Per-token overhead across 45 layer transitions = ~22 us total = completely negligible. Unknown danveloper#3 — GPU + ANE simultaneous load contention: Ran ane_dispatch_bench concurrent with TQ_KV=1 ./infer --tokens 128. ANE under concurrent GPU load: p50 +4%, p90 +5%, p99 +14%. GPU inference: 6.05 tok/s vs historical 5.65-5.91 tok/s baseline at TQ_KV=1 128tok — within noise. Unified memory bandwidth is not a bottleneck for the two engines. Verdict: ALL THREE DECISION GATES PASSED. Proceed to Phase 1. Revised per-token budget for the full ANE offload port: ANE path: 15 super-blocks × ~6.97 ms (with concurrent penalty) = ~104 ms GPU path: 60 MoE dispatches in parallel = ~78 ms Wall clock: max(104, 78) = ~104 ms/token vs current warm-cache baseline: ~150 ms/token Expected speedup: ~31% wall-clock This is materially better than the scoping doc's 10-15% estimate. The scoping doc was written with conservative assumptions about CoreML dispatch overhead; the actual measurement is 2.6 ms *below* the reference, not above it. superblock0.mlmodelc/ bundle (413 MB) is gitignored; reproduce via: rsync -a carl@192.168.0.62:/Users/carl/models/anemll-qwen3.5-9b/\ qwen3_5_superblock0_lut4.mlmodelc/ ane_bench/superblock0.mlmodelc/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Upstream + fork + issue context compiled for the port effort: PR diffs (danveloper#3 runtime config, danveloper#11 perf wins, danveloper#13 Qwen3-Coder-Next, danveloper#14 8-bit dequant), fork summaries (nerds-odd-e, gorroai), issue captures (danveloper#15 setup gotchas, danveloper#17 expert_index scope bug, danveloper#20 other Qwen models), target architecture spec (qwen3.6-35b-a3b-arch.md), hardcoded-constants map of upstream flash-moe, condensed port plan. Plus benchmark results, parallelism exploration, 10x optimization ideas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Alexintosh and others added 7 commits March 20, 2026 22:20

chore: update header and banner for runtime model config

1ae86be

Generalize file header comment to describe multi-model support. Update startup banner from hardcoded model name to "Flash-MoE" with dynamic config path display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: runtime model config from HuggingFace config.json#3

feat: runtime model config from HuggingFace config.json#3
Alexintosh wants to merge 7 commits into
danveloper:mainfrom
Alexintosh:feature/runtime-model-config

Alexintosh commented Mar 20, 2026

Uh oh!

fiveangle commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alexintosh commented Mar 20, 2026

Summary

What changed in infer.m

Test plan

Uh oh!

fiveangle commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What changed in `infer.m`

fiveangle commented Mar 27, 2026 •

edited

Loading