Add Metal 4 M5 prefill optimizations#15
Conversation
|
Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks. |
|
100% with you, we need to sort this out and let you get an M5 😎 |
|
This should solve #14 |
|
@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it. |
|
@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch. |
|
Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far. |
980ba1a to
68547f8
Compare
|
logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different. |
This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).
This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).
|
If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity. |
|
@ucjonathan I was going to propose the same thing! |
This comment was marked as resolved.
This comment was marked as resolved.
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.
68547f8 to
b703636
Compare
…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:
The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at |
|
Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness. |

Summary
DS4_METAL_MPP_DISABLE=1,DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)Benchmarks
Prompt source:
README.md; command shape:./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.8192-token routed MoE stage profile, current M5 default:
Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.
Validation
make ds4 ds4_test./ds4_test --metal-kernels./ds4_test --long-context./ds4_test --logprob-vectorsNotes
upboundaries withgate=13were tested but failed long-context; gate/up stay paired at layer 13.DS4_METAL_MOE_MID_F32=1looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.