feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal by audreyt · Pull Request #60 · antirez/ds4

audreyt · 2026-05-10T12:30:37Z

What this changes

DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without
per-tensor type overrides ship most of the small projections at Q8_0 (and the
routed-expert router at F32) where the antirez recipe keeps them at F16.

Examples: the cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
models. On stock ds4 main these load fails at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, …), and even
after the validators are relaxed, several Metal kernel paths read weight bytes
directly via offset arithmetic that hard-codes F16/F32 strides and produce
silently wrong output for Q8_0.

This PR makes the embed / HC / compressor / indexer / router validators and
the corresponding Metal kernel paths polymorphic, so the same GGUF loads and
runs on Metal end-to-end on audreyt/pi-ds4.

Validators (`ds4.c`)

New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0 and is
applied to every projection that flows through a type-dispatching
matvec/matmul: output_hc_fn, hc_attn_fn, hc_ffn_fn,
attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
token_embd keeps its own inline F16/Q8_0 check because its CPU embed
kernel doesn't go through matvec_any.
The two compressor decode-time guards (attn_compressor and
indexer_compressor pair-projection paths) relaxed from "F16 only" to
"F16 or Q8_0, paired type must match".

CPU paths (`ds4.c`)

Refactor embed_token_f16 into an embed_token dispatcher; add
embed_token_q8_0 (block-wise dequant of block_q8_0).
Replace the remaining direct matvec_f16 / matvec_f16_serial callers
(HC fn, output_hc_fn, ffn_gate_inp) with the existing matvec_any
dispatcher; add matvec_any_serial for the HC pre/post path.
Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the pair
fuses with the existing F16-pair kernel when both tensors are F16,
otherwise dispatches to two single matmuls). All 22 hardcoded
ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
attn/indexer compressors, indexer projections, output head, router)
converted to use these wrappers.

Metal kernels

metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
dequantizes its source block on the fly).
metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for the
multi-token F32 weight matmul that the F32 router path needs in prefill
(mirrors the existing F16/Q8_0 mul_mm_t instantiations).
metal/dsv4_kv.metal: a Q8_0 branch added to
kernel_dsv4_compressor_store_one. Without this, the decode-time
single-row compressor store treats Q8_0 ape as F32 and reads garbage.

Metal wiring (`ds4_metal.m`)

Register g_get_rows_q8_0_pipeline at init; clear at cleanup.
Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
ds4_metal_encode_get_rows helper take a new weight_type parameter
(GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
weights->token_embd->type unchanged. ds4_metal_embed_row_layout picks
the right per-row stride and pipeline.
ds4_metal_matmul_f32_tensor extended with a multi-token branch that
dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing n_tok = 1
path unchanged.
ds4_metal_encode_compressor_score_with_ape and the equivalent loop in
ds4_metal_compressor_store_batch_tensor: for Q8_0 ape, dequantize on the
CPU into a per-call private MTLBuffer and feed that into the existing
add_f32_1d. Two new helpers (ds4_metal_half_bits_to_float,
ds4_metal_cpu_dequant_q8_0_rows) implement the conversion; the CPU
dequant matches gguf-py's dequantize reference byte-for-byte (verified
in a standalone numeric check). A per-call buffer is required because
multiple CPU writes to the previously-shared
g_compressor_store_ape_buffer within one command buffer collapse to the
last write at execute time (Metal kernels run in encode order, but CPU
writes don't participate in that ordering when the same scratch is
reused). The per-call buffer is retained until cb completion via
addCompletedHandler because Metal does not strongly retain buffers
bound to encoders.
Six ape_type validators relaxed to also accept 8 (Q8_0).
Six ape_bytes calculations centralized through a new
ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the correct
stride for F16 / F32 / Q8_0.
metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Why CPU dequant for Q8_0 ape (and not a Metal kernel)

I first wrote a kernel_cpy_q8_0_f32 Metal kernel using the same
block_q8_0 indexing pattern that the working dense Q8_0 matvec/matmul
kernels in metal/dense.metal use. It compiled cleanly but produced
silently wrong values for the actual compressor APE shapes (4 rows × 1024
cols of block_q8_0). I confirmed this by a side-by-side numeric check
against gguf-py's dequantize reference: my CPU dequant matches
byte-for-byte; the Metal kernel does not. I left kernel_cpy_q8_0_f32 in
metal/cpy.metal (its registration in ds4_metal.m is harmless) so that a
future debug session can pick it up; the compressor paths use the CPU
dequant as the active route.

What this PR does not cover

The CPU MoE path (ds4.c:5198, 5291) still hardcodes
if (gate_w->type != IQ2_XXS) ds4_die(...). That path is reference/debug
per AGENT.md and the production Metal flow doesn't touch it. If
something forces CPU fallback (Metal disabled, MTP without Metal, certain
trace modes) on a stock-recipe Q8_0 GGUF you'll see "expected IQ2_XXS
expert tensors" and need a Q8_0 dispatch added there too. Out of scope
for this PR; the production Metal flow is fine.
No quantization changes, no recipe changes, no new GGUF formats. This is
a loader/dispatcher change so existing GGUFs that happen to use the
stock recipe become loadable.

Test matrix (macOS / M5/ Metal)

make ds4-server clean (one pre-existing -Wpointer-sign warning from
the unrelated MoE path, not introduced by this PR).
Cyberneurova Q2_K GGUF entirely unmodified, default flags:
21-token prompt → coherent generation
("An LLM, or Large Language Model, is a type of artificial intelligence").
Without the compressor APE fix, this prompt generated a few coherent
tokens then <BOS> token spam.
Pre-harmonized variant (token_embd / HC / compressor / indexer all
F16): still works byte-for-byte the same as before, no F16/F32 path
regressions.
make ds4-server build clean across both branches.

DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without per-tensor type overrides ship most of the small projections at Q8_0 (and routed-expert router weights at F32) where the antirez recipe keeps them at F16. Examples include the cyberneurova abliterated GGUFs. On stock ds4 main these load fails loudly at the first F16-strict validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and even after the validators are relaxed, several Metal kernel paths read weight bytes directly via offset arithmetic that hard-codes F16/F32 strides. This change makes the embed/HC/compressor/indexer/router validators *and* the corresponding Metal kernel paths polymorphic, so the same GGUF loads and runs with no harmonizer step. Validators (ds4.c): * New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0 and is applied to every projection that flows through a type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn, hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj}, indexer_compressor_{ape,gate,kv}, ffn_gate_inp. * token_embd keeps its own inline F16/Q8_0 check because its CPU embed kernel doesn't go through matvec_any. * Two compressor decode-time guards (attn_compressor and indexer_compressor pair-projection paths) relaxed from "F16 only" to "F16 or Q8_0, paired type must match". CPU paths (ds4.c): * Refactor embed_token_f16 into an embed_token dispatcher; add embed_token_q8_0 (block-wise dequant of block_q8_0). * Replace the remaining direct matvec_f16 / matvec_f16_serial callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing matvec_any dispatcher; add matvec_any_serial for the HC pre/post path. * Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the pair fuses with the existing F16-pair kernel when both tensors are F16, otherwise dispatches to two single matmuls). All 22 hardcoded ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix, attn/indexer compressors, indexer projections, output head, router) converted to use these wrappers. Metal kernels: * metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread, dequantizes its source block on the fly). * metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for the multi-token F32 weight matmul that the F32 router path needs in prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations). * metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by the compressor APE byte-strided reader). Metal wiring (ds4_metal.m): * Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at init; clear them at cleanup. * Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared ds4_metal_encode_get_rows helper take a new weight_type parameter (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward weights->token_embd->type unchanged. ds4_metal_embed_row_layout picks the right per-row stride and pipeline. * ds4_metal_matmul_f32_tensor extended with a multi-token branch that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing n_tok = 1 path unchanged. * ds4_metal_encode_compressor_score_with_ape and the equivalent loop in ds4_metal_compressor_prefill_tensor add a Q8_0 branch (ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that accounts for the block_q8_0 layout. * Six ape_type validators relaxed to also accept 8 (Q8_0). * Six ape_bytes calculations centralized through a new ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the correct stride for F16/F32/Q8_0. * metal_graph_matmul_plain_tensor extended with a Q8_0 branch. Tested on macOS / M-series / Metal: * make ds4-server clean (no new warnings). * Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill + decode through to coherent generation ("PASS" returned for the "reply with the single word PASS" prompt). * Pre-harmonized variant (token_embd / hc / compressor / indexer all F16, ffn_gate_inp F16): still works byte-for-byte the same as before this change, no F16 path regressions. Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of this: the unmodified cyberneurova file generates garbage (BOS spam) when MPP F16 prefill is engaged, but produces coherent output with DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's MPP path alone and is independent of the changes here; it surfaces only because this PR makes the Q8_0 file loadable in the first place.

This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the validator level, but two follow-on Metal paths still treat them as F16 (or fall through to F32) and produce silently wrong output, which shows up as <BOS>-token spam in generation for any prompt long enough to exercise the multi-token compressor path on M-series hardware. 1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE byte-strided dequant) compiles cleanly and follows the same block_q8_0 indexing pattern used by other working Q8_0 kernels in dense.metal, but emits silently wrong values for the actual ape shapes (4 rows x 1024 cols of block_q8_0). Confirmed by isolating the kernel: a CPU-side dequant of the same byte region matches gguf-py's `dequantize` reference byte-for-byte, while the Metal kernel's output is wrong. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and fell through to F32 for everything else, so Q8_0 ape was reading garbage at decode time. Fix: * Replace the prefill APE Q8_0 path in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant via two new helpers (`ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private MTLBuffer. A per-call buffer is required because multiple CPU writes to the previously-shared `g_compressor_store_ape_buffer` within one command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via `addCompletedHandler` because Metal does not strongly retain buffers bound to encoders. * Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) inline. The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is no longer reached from the compressor paths; its registration in ds4_metal.m is harmless and a future debug session can either fix it or drop it. Tested on macOS / M-series / Metal: * make ds4-server clean (one pre-existing -Wpointer-sign warning from the unrelated MoE path). * Cyberneurova Q2_K GGUF entirely unmodified, default flags: 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated a few coherent tokens then <BOS> token spam. * Pre-harmonized variant (token_embd / hc / compressor / indexer all F16): still works byte-for-byte the same as before this fix; no F16 / F32 path regressions.

fry69 · 2026-05-10T17:44:47Z

I am trying your PR with cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf

A simple "Hello" works:

$ ./ds4 --ctx 100000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf -p "Hello"
ds4: context buffers 2842.64 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.490 ms, residency requested in 477.369 ms, warmup 3.854 ms (mapped 94228.38 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: Metal backend initialized for graph diagnostics
We need to respond to the user's greeting. The user said "Hello". As a helpful assistant, I should respond politely and ask how I can assist.
Hello! How can I help you today?
ds4: prefill: 41.37 t/s, generation: 38.94 t/s

But a longer prompt produces an error like this, after about 2 lines of thinking tokens output:

$ ./ds4 --ctx 100000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf --prompt-file ./prompt.md
ds4: context buffers 2842.64 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.456 ms, residency requested in 467.611 ms, warmup 4.040 ms (mapped 94228.38 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: Metal backend initialized for graph diagnostics
[about 100 thinking tokens skipped]ds4: Metal graph indexer q projection expects F16 weights
ds4: decode failed: Metal decode failed

FWIW I think this is the smoking gun:

ds4: Metal graph indexer q projection expects F16 weights

The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095) still has two F16-only validators on indexer_attn_q_b and indexer_proj that I missed in the initial loader pass. These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e. once the compressor has accumulated more rows than the decode-time top-k. For short generations the path isn't reached; for ~400+ token generations on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes with finish_reason="error" / "Metal decode failed". The downstream calls already use metal_graph_matmul_plain_tensor (which dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time validator at line 2211-2212 already uses tensor_expect_dispatch_layout, which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16. Reproducer (cyberneurova Q2_K, default flags): a "write a long story" prompt that generates ~800 tokens hits the validator after ~400 tokens and the request errors out. After this fix, the same prompt streams 800+ tokens cleanly.

audreyt · 2026-05-10T18:25:07Z

ds4: Metal graph indexer q projection expects F16 weights

Fixed in c2144e5!

fry69 · 2026-05-10T18:37:12Z

Great! Many thanks for the quick fix!

I can confirm that long prompts now work with the above mentioned model file and no longer produce an error.

There is a small likely cosmetic warning while compiling:

$ make clean && make
rm -f ds4 ds4-server ds4_native ds4_server_test ds4_test *.o
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_cli.o ds4_cli.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o linenoise.o linenoise.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4.o ds4.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -fobjc-arc -c -o ds4_metal.o ds4_metal.m
ds4_metal.m:8801:12: warning: unused function 'ds4_metal_encode_cpy_q8_0_f32_1d' [-Wunused-function]
 8801 | static int ds4_metal_encode_cpy_q8_0_f32_1d(
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4 ds4_cli.o linenoise.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework Metal
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_server.o ds4_server.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o rax.o rax.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4-server ds4_server.o rax.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework Metal

Also this PR currently is 4 commits behind main. From the looks of it, nothing should conflict, it can be merged cleanly.

Update: I checked this PR rebased against current main (22ca6ab) and it works also flawlessly with said model.

The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb (switched to CPU-side dequant to avoid an encode-time race on the shared compressor scratch buffer), leaving the function unused and tripping -Wunused-function on stock Make builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

audreyt · 2026-05-10T18:51:30Z

There is a small likely cosmetic warning while compiling:

Fixed and synced from main. Ready for review from @antirez.

audreyt added 2 commits May 10, 2026 06:22

audreyt marked this pull request as draft May 10, 2026 12:36

audreyt marked this pull request as ready for review May 10, 2026 18:13

audreyt changed the title ~~feat(loader): support stock-recipe (Q8_0/F32) GGUFs end-to-end on Metal~~ feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal May 10, 2026

audreyt and others added 2 commits May 10, 2026 14:44

Merge remote-tracking branch 'origin/main' into support-q8_0-token-embd

86aa248

audreyt mentioned this pull request May 10, 2026

Add Metal 4 M5 prefill optimizations #15

Draft

Merge remote-tracking branch 'origin/main' into support-q8_0-token-embd

a0f1a0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60
audreyt wants to merge 6 commits intoantirez:mainfrom
audreyt:support-q8_0-token-embd

audreyt commented May 10, 2026 •

edited

Loading

Uh oh!

fry69 commented May 10, 2026

Uh oh!

audreyt commented May 10, 2026 •

edited

Loading

Uh oh!

fry69 commented May 10, 2026 •

edited

Loading

Uh oh!

audreyt commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

audreyt commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Validators (ds4.c)

CPU paths (ds4.c)

Metal kernels

Metal wiring (ds4_metal.m)

Why CPU dequant for Q8_0 ape (and not a Metal kernel)

What this PR does not cover

Test matrix (macOS / M5/ Metal)

Uh oh!

fry69 commented May 10, 2026

Uh oh!

audreyt commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fry69 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

audreyt commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

audreyt commented May 10, 2026 •

edited

Loading

Validators (`ds4.c`)

CPU paths (`ds4.c`)

Metal wiring (`ds4_metal.m`)

audreyt commented May 10, 2026 •

edited

Loading

fry69 commented May 10, 2026 •

edited

Loading