Add Metal 4 M5 prefill optimizations by ivanfioravanti · Pull Request #15 · antirez/ds4

ivanfioravanti · 2026-05-08T18:09:16Z

Summary

enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections
promote correctness-gated routed MoE boundaries: down from layer 2, gate/up from layer 13
add a fused six-expert routed MoE sum kernel for the common top-k=6 prefill shape
keep experimental probes and escape hatches for ablation (DS4_METAL_MPP_DISABLE=1, DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)

Benchmarks

Prompt source: README.md; command shape: ./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.

target tokens	standard non-M5 avg tok/s	M5 no sum6 avg tok/s	M5 default avg tok/s	M5 speedup
512	236.97	390.29	389.47	1.64x
2048	324.90	480.07	478.36	1.47x
4096	289.97	454.41	455.52	1.57x
8192	287.60	440.43	442.69	1.54x
16384	284.26	427.62	427.78	1.50x

8192-token routed MoE stage profile, current M5 default:

stage	total ms	mean per layer ms	share
up	2443.497	11.365	35.1%
gate	2422.942	11.269	34.8%
down	1727.114	8.033	24.8%
activation_weight	142.693	0.664	2.1%
sum	137.482	0.639	2.0%
map	68.601	0.319	1.0%
gate_up	12.389	0.058	0.2%

Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.

Validation

make ds4 ds4_test
./ds4_test --metal-kernels
./ds4_test --long-context
./ds4_test --logprob-vectors

Notes

Earlier up boundaries with gate=13 were tested but failed long-context; gate/up stay paired at layer 13.
A paired MPP gate+up matmul prototype compiled and ran but was slower overall, so it was not included.
DS4_METAL_MOE_MID_F32=1 looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.

antirez · 2026-05-09T11:17:22Z

Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks.

ivanfioravanti · 2026-05-09T14:08:00Z

100% with you, we need to sort this out and let you get an M5 😎

ivanfioravanti · 2026-05-09T19:37:44Z

This should solve #14

antirez · 2026-05-09T21:04:42Z

@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it.

ottaviofogliata · 2026-05-09T21:44:52Z

@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch.

ivanfioravanti · 2026-05-09T22:14:21Z

Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far.

ivanfioravanti · 2026-05-09T22:48:45Z

I squeezed a little more juice. Tomorrow I'll test with pi mono for some coding sessions and I will also get server side stats instead of client side.

ivanfioravanti · 2026-05-10T07:22:35Z

logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different.

This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).

ucjonathan · 2026-05-10T10:53:54Z

If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity.

ivanfioravanti · 2026-05-10T11:26:33Z

@ucjonathan I was going to propose the same thing!

When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

audreyt · 2026-05-10T19:00:09Z

Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:

kernel_dsv4_compressor_store_one (in metal/dsv4_kv.metal, upstream) only handles F16/F32 ape — Q8_0 falls through to the F32 branch and reads garbage.
A naive Q8_0 GPU dequant kernel for the prefill APE byte-strided path produces silently wrong output on M5 Max for compressor APE shapes (4 rows × 1024 cols). CPU-side dequant into a per-call private MTLBuffer is what works — a shared scratch loses writes when the same buffer is reused inside one command buffer (Metal kernels run in encode order; CPU writes don't participate in that ordering when the scratch is reused).

The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at audreyt/ds4 m5-support-q8_0-token-embd (commit 79b08bb) would need to land alongside it. Happy to coordinate.

ivanfioravanti · 2026-05-10T19:31:18Z

Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness.
This model rocks and this engine too!

ivanfioravanti marked this pull request as ready for review May 8, 2026 19:28

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 980ba1a to 68547f8 Compare May 9, 2026 22:31

ivanfioravanti marked this pull request as draft May 10, 2026 07:21

This comment was marked as resolved.

Sign in to view

Add Metal 4 M5 scaffold

b703636

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 68547f8 to b703636 Compare May 10, 2026 14:54

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026

docs(readme): describe audreyt fork (antirez + ivan PR antirez#15 + Q…

48eb974

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Improve Metal MPP diagnostics and safe defaults

0fe868d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Metal 4 M5 prefill optimizations#15

Add Metal 4 M5 prefill optimizations#15
ivanfioravanti wants to merge 2 commits intoantirez:mainfrom
ivanfioravanti:codex/metal4-m5-scaffold

ivanfioravanti commented May 8, 2026

Uh oh!

antirez commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

antirez commented May 9, 2026

Uh oh!

ottaviofogliata commented May 9, 2026 •

edited

Loading

Uh oh!

ivanfioravanti commented May 9, 2026 •

edited

Loading

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

ucjonathan commented May 10, 2026 •

edited

Loading

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

This comment was marked as resolved.

audreyt commented May 10, 2026

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ivanfioravanti commented May 8, 2026

Summary

Benchmarks

Validation

Notes

Uh oh!

antirez commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

antirez commented May 9, 2026

Uh oh!

ottaviofogliata commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

ucjonathan commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

This comment was marked as resolved.

audreyt commented May 10, 2026

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ottaviofogliata commented May 9, 2026 •

edited

Loading

ivanfioravanti commented May 9, 2026 •

edited

Loading

ucjonathan commented May 10, 2026 •

edited

Loading