Skip to content

Add Metal 4 M5 prefill optimizations#15

Draft
ivanfioravanti wants to merge 2 commits intoantirez:mainfrom
ivanfioravanti:codex/metal4-m5-scaffold
Draft

Add Metal 4 M5 prefill optimizations#15
ivanfioravanti wants to merge 2 commits intoantirez:mainfrom
ivanfioravanti:codex/metal4-m5-scaffold

Conversation

@ivanfioravanti
Copy link
Copy Markdown

Summary

  • enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections
  • promote correctness-gated routed MoE boundaries: down from layer 2, gate/up from layer 13
  • add a fused six-expert routed MoE sum kernel for the common top-k=6 prefill shape
  • keep experimental probes and escape hatches for ablation (DS4_METAL_MPP_DISABLE=1, DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)

Benchmarks

Prompt source: README.md; command shape: ./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.

target tokens standard non-M5 avg tok/s M5 no sum6 avg tok/s M5 default avg tok/s M5 speedup
512 236.97 390.29 389.47 1.64x
2048 324.90 480.07 478.36 1.47x
4096 289.97 454.41 455.52 1.57x
8192 287.60 440.43 442.69 1.54x
16384 284.26 427.62 427.78 1.50x

8192-token routed MoE stage profile, current M5 default:

stage total ms mean per layer ms share
up 2443.497 11.365 35.1%
gate 2422.942 11.269 34.8%
down 1727.114 8.033 24.8%
activation_weight 142.693 0.664 2.1%
sum 137.482 0.639 2.0%
map 68.601 0.319 1.0%
gate_up 12.389 0.058 0.2%

Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.

Validation

  • make ds4 ds4_test
  • ./ds4_test --metal-kernels
  • ./ds4_test --long-context
  • ./ds4_test --logprob-vectors

Notes

  • Earlier up boundaries with gate=13 were tested but failed long-context; gate/up stay paired at layer 13.
  • A paired MPP gate+up matmul prototype compiled and ran but was slower overall, so it was not included.
  • DS4_METAL_MOE_MID_F32=1 looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.

@ivanfioravanti ivanfioravanti marked this pull request as ready for review May 8, 2026 19:28
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 9, 2026

Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks.

@ivanfioravanti
Copy link
Copy Markdown
Author

100% with you, we need to sort this out and let you get an M5 😎

@ivanfioravanti
Copy link
Copy Markdown
Author

This should solve #14

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 9, 2026

@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it.

@ottaviofogliata
Copy link
Copy Markdown

ottaviofogliata commented May 9, 2026

@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch.

@ivanfioravanti
Copy link
Copy Markdown
Author

ivanfioravanti commented May 9, 2026

Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far.

@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 980ba1a to 68547f8 Compare May 9, 2026 22:31
@ivanfioravanti
Copy link
Copy Markdown
Author

I squeezed a little more juice. Tomorrow I'll test with pi mono for some coding sessions and I will also get server side stats instead of client side.

comparison_speed_chart

@ivanfioravanti ivanfioravanti marked this pull request as draft May 10, 2026 07:21
@ivanfioravanti
Copy link
Copy Markdown
Author

logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different.

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
This is a personal fork that combines two open upstream PRs (the
support-q8_0-token-embd loader PR I sent to antirez/ds4, and
ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can
run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware
before either PR lands upstream.

The README explains:

  * What the two combined PRs do and why they're combined here.
  * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock
    recipe, cyberneurova pre-harmonized).
  * The known MPP F16 + cyberneurova interaction (workaround:
    DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader
    PR's scope.
  * Build / run instructions for both recipes.
  * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and
    the cyberneurova research project.
  * Pointer back to the upstream README for the original design and
    server/CLI docs (no duplication).
audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
This is a personal fork that combines two open upstream PRs (the
support-q8_0-token-embd loader PR I sent to antirez/ds4, and
ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can
run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware
before either PR lands upstream.

The README explains:

  * What the two combined PRs do and why they're combined here.
  * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock
    recipe, cyberneurova pre-harmonized).
  * The known MPP F16 + cyberneurova interaction (workaround:
    DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader
    PR's scope.
  * Build / run instructions for both recipes.
  * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and
    the cyberneurova research project.
  * Pointer back to the upstream README for the original design and
    server/CLI docs (no duplication).
@ucjonathan
Copy link
Copy Markdown

ucjonathan commented May 10, 2026

If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity.

@ivanfioravanti
Copy link
Copy Markdown
Author

@ucjonathan I was going to propose the same thing!

@lobanov

This comment was marked as resolved.

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32
router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the
compressor APE path silently produces wrong output and prefill emits
garbage tokens (typically <BOS> spam after a few coherent tokens).

The prefill is correct; the bug is in two compressor APE consumers that
were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout
correctly:

1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd
   for the prefill APE byte-strided dequant): produces silently wrong
   output on M5 Max for the compressor APE shapes (4 rows x 1024 cols).
   Replaced with a CPU-side dequant into a per-call private MTLBuffer.
   The CPU dequant matches gguf-py reference byte-for-byte (verified
   with a standalone numeric check); the Metal kernel did not.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in
   metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into
   the F32 else branch and read garbage.  Add a Q8_0 branch that walks
   block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte
   block) directly.

The CPU dequant path also has to use a *fresh per-call* MTLBuffer for
each compressor invocation, not the shared g_compressor_store_ape_buffer:
multiple CPU writes to one shared buffer in the same command buffer
collapse to the last write at execute time (Metal kernels run in encode
order, but CPU writes don't participate in that ordering when the same
scratch is reused).  The per-call buffer is retained until cb completion
via addCompletedHandler because Metal does not strongly retain buffers
bound to encoders.

Changes:

  * ds4_metal.m: new `ds4_metal_half_bits_to_float` and
    `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py
    `dequantize` reference); replace Q8_0 branches in
    `ds4_metal_encode_compressor_score_with_ape` and
    `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant
    into per-call private buffers retained via addCompletedHandler.
  * metal/dsv4_kv.metal: add a Q8_0 branch to
    `kernel_dsv4_compressor_store_one`.
  * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but
    no longer reached from the compressor paths (its registration in
    ds4_metal.m is harmless).

Tested on macOS / M-series / Metal:

  * make ds4-server clean.
  * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e.
    no DS4_METAL_MPP_F16_DISABLE workaround):
    21-token prompt -> coherent generation
    ("An LLM, or Large Language Model, is a type of artificial intelligence").
    Previously this prompt generated "An LLM, or large language" then
    <BOS> token spam.
  * Pre-harmonized variant: still works byte-for-byte the same as
    before this change, no F16/F32 path regressions.
@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 68547f8 to b703636 Compare May 10, 2026 14:54
audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
…8_0 fixes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@audreyt
Copy link
Copy Markdown

audreyt commented May 10, 2026

Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:

  1. kernel_dsv4_compressor_store_one (in metal/dsv4_kv.metal, upstream) only handles F16/F32 ape — Q8_0 falls through to the F32 branch and reads garbage.
  2. A naive Q8_0 GPU dequant kernel for the prefill APE byte-strided path produces silently wrong output on M5 Max for compressor APE shapes (4 rows × 1024 cols). CPU-side dequant into a per-call private MTLBuffer is what works — a shared scratch loses writes when the same buffer is reused inside one command buffer (Metal kernels run in encode order; CPU writes don't participate in that ordering when the scratch is reused).

The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at audreyt/ds4 m5-support-q8_0-token-embd (commit 79b08bb) would need to land alongside it. Happy to coordinate.

@ivanfioravanti
Copy link
Copy Markdown
Author

Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness.
This model rocks and this engine too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants