Skip to content

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60

Open
audreyt wants to merge 6 commits intoantirez:mainfrom
audreyt:support-q8_0-token-embd
Open

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60
audreyt wants to merge 6 commits intoantirez:mainfrom
audreyt:support-q8_0-token-embd

Conversation

@audreyt
Copy link
Copy Markdown

@audreyt audreyt commented May 10, 2026

What this changes

DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without
per-tensor type overrides ship most of the small projections at Q8_0 (and the
routed-expert router at F32) where the antirez recipe keeps them at F16.

Examples: the cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
models. On stock ds4 main these load fails at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, …), and even
after the validators are relaxed, several Metal kernel paths read weight bytes
directly via offset arithmetic that hard-codes F16/F32 strides and produce
silently wrong output for Q8_0.

This PR makes the embed / HC / compressor / indexer / router validators and
the corresponding Metal kernel paths polymorphic, so the same GGUF loads and
runs on Metal end-to-end on audreyt/pi-ds4.

Validators (ds4.c)

  • New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0 and is
    applied to every projection that flows through a type-dispatching
    matvec/matmul: output_hc_fn, hc_attn_fn, hc_ffn_fn,
    attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
    indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
  • token_embd keeps its own inline F16/Q8_0 check because its CPU embed
    kernel doesn't go through matvec_any.
  • The two compressor decode-time guards (attn_compressor and
    indexer_compressor pair-projection paths) relaxed from "F16 only" to
    "F16 or Q8_0, paired type must match".

CPU paths (ds4.c)

  • Refactor embed_token_f16 into an embed_token dispatcher; add
    embed_token_q8_0 (block-wise dequant of block_q8_0).
  • Replace the remaining direct matvec_f16 / matvec_f16_serial callers
    (HC fn, output_hc_fn, ffn_gate_inp) with the existing matvec_any
    dispatcher; add matvec_any_serial for the HC pre/post path.
  • Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
    and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the pair
    fuses with the existing F16-pair kernel when both tensors are F16,
    otherwise dispatches to two single matmuls). All 22 hardcoded
    ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
    attn/indexer compressors, indexer projections, output head, router)
    converted to use these wrappers.

Metal kernels

  • metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
    dequantizes its source block on the fly).
  • metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for the
    multi-token F32 weight matmul that the F32 router path needs in prefill
    (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
  • metal/dsv4_kv.metal: a Q8_0 branch added to
    kernel_dsv4_compressor_store_one. Without this, the decode-time
    single-row compressor store treats Q8_0 ape as F32 and reads garbage.

Metal wiring (ds4_metal.m)

  • Register g_get_rows_q8_0_pipeline at init; clear at cleanup.
  • Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
    ds4_metal_encode_get_rows helper take a new weight_type parameter
    (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
    weights->token_embd->type unchanged. ds4_metal_embed_row_layout picks
    the right per-row stride and pipeline.
  • ds4_metal_matmul_f32_tensor extended with a multi-token branch that
    dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing n_tok = 1
    path unchanged.
  • ds4_metal_encode_compressor_score_with_ape and the equivalent loop in
    ds4_metal_compressor_store_batch_tensor: for Q8_0 ape, dequantize on the
    CPU into a per-call private MTLBuffer and feed that into the existing
    add_f32_1d. Two new helpers (ds4_metal_half_bits_to_float,
    ds4_metal_cpu_dequant_q8_0_rows) implement the conversion; the CPU
    dequant matches gguf-py's dequantize reference byte-for-byte (verified
    in a standalone numeric check). A per-call buffer is required because
    multiple CPU writes to the previously-shared
    g_compressor_store_ape_buffer within one command buffer collapse to the
    last write at execute time (Metal kernels run in encode order, but CPU
    writes don't participate in that ordering when the same scratch is
    reused). The per-call buffer is retained until cb completion via
    addCompletedHandler because Metal does not strongly retain buffers
    bound to encoders.
  • Six ape_type validators relaxed to also accept 8 (Q8_0).
  • Six ape_bytes calculations centralized through a new
    ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the correct
    stride for F16 / F32 / Q8_0.
  • metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Why CPU dequant for Q8_0 ape (and not a Metal kernel)

I first wrote a kernel_cpy_q8_0_f32 Metal kernel using the same
block_q8_0 indexing pattern that the working dense Q8_0 matvec/matmul
kernels in metal/dense.metal use. It compiled cleanly but produced
silently wrong values for the actual compressor APE shapes (4 rows × 1024
cols of block_q8_0). I confirmed this by a side-by-side numeric check
against gguf-py's dequantize reference: my CPU dequant matches
byte-for-byte; the Metal kernel does not. I left kernel_cpy_q8_0_f32 in
metal/cpy.metal (its registration in ds4_metal.m is harmless) so that a
future debug session can pick it up; the compressor paths use the CPU
dequant as the active route.

What this PR does not cover

  • The CPU MoE path (ds4.c:5198, 5291) still hardcodes
    if (gate_w->type != IQ2_XXS) ds4_die(...). That path is reference/debug
    per AGENT.md and the production Metal flow doesn't touch it. If
    something forces CPU fallback (Metal disabled, MTP without Metal, certain
    trace modes) on a stock-recipe Q8_0 GGUF you'll see "expected IQ2_XXS
    expert tensors" and need a Q8_0 dispatch added there too. Out of scope
    for this PR; the production Metal flow is fine.
  • No quantization changes, no recipe changes, no new GGUF formats. This is
    a loader/dispatcher change so existing GGUFs that happen to use the
    stock recipe become loadable.

Test matrix (macOS / M5/ Metal)

  • make ds4-server clean (one pre-existing -Wpointer-sign warning from
    the unrelated MoE path, not introduced by this PR).
  • Cyberneurova Q2_K GGUF entirely unmodified, default flags:
    21-token prompt → coherent generation
    ("An LLM, or Large Language Model, is a type of artificial intelligence").
    Without the compressor APE fix, this prompt generated a few coherent
    tokens then <BOS> token spam.
  • Pre-harmonized variant (token_embd / HC / compressor / indexer all
    F16): still works byte-for-byte the same as before, no F16/F32 path
    regressions.
  • make ds4-server build clean across both branches.

audreyt added 2 commits May 10, 2026 06:22
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter
without per-tensor type overrides ship most of the small projections at
Q8_0 (and routed-expert router weights at F32) where the antirez recipe
keeps them at F16. Examples include the cyberneurova abliterated GGUFs.
On stock ds4 main these load fails loudly at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and
even after the validators are relaxed, several Metal kernel paths read
weight bytes directly via offset arithmetic that hard-codes F16/F32
strides.

This change makes the embed/HC/compressor/indexer/router validators
*and* the corresponding Metal kernel paths polymorphic, so the same
GGUF loads and runs with no harmonizer step.

Validators (ds4.c):

  * New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0
    and is applied to every projection that flows through a
    type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn,
    hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
    indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
  * token_embd keeps its own inline F16/Q8_0 check because its CPU
    embed kernel doesn't go through matvec_any.
  * Two compressor decode-time guards (attn_compressor and
    indexer_compressor pair-projection paths) relaxed from "F16 only"
    to "F16 or Q8_0, paired type must match".

CPU paths (ds4.c):

  * Refactor embed_token_f16 into an embed_token dispatcher; add
    embed_token_q8_0 (block-wise dequant of block_q8_0).
  * Replace the remaining direct matvec_f16 / matvec_f16_serial
    callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing
    matvec_any dispatcher; add matvec_any_serial for the HC pre/post
    path.
  * Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
    and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the
    pair fuses with the existing F16-pair kernel when both tensors are
    F16, otherwise dispatches to two single matmuls). All 22 hardcoded
    ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
    attn/indexer compressors, indexer projections, output head, router)
    converted to use these wrappers.

Metal kernels:

  * metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
    dequantizes its source block on the fly).
  * metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for
    the multi-token F32 weight matmul that the F32 router path needs in
    prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
  * metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by
    the compressor APE byte-strided reader).

Metal wiring (ds4_metal.m):

  * Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at
    init; clear them at cleanup.
  * Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
    ds4_metal_encode_get_rows helper take a new weight_type parameter
    (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
    weights->token_embd->type unchanged. ds4_metal_embed_row_layout
    picks the right per-row stride and pipeline.
  * ds4_metal_matmul_f32_tensor extended with a multi-token branch
    that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing
    n_tok = 1 path unchanged.
  * ds4_metal_encode_compressor_score_with_ape and the equivalent loop
    in ds4_metal_compressor_prefill_tensor add a Q8_0 branch
    (ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that
    accounts for the block_q8_0 layout.
  * Six ape_type validators relaxed to also accept 8 (Q8_0).
  * Six ape_bytes calculations centralized through a new
    ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the
    correct stride for F16/F32/Q8_0.
  * metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Tested on macOS / M-series / Metal:

  * make ds4-server clean (no new warnings).
  * Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill +
    decode through to coherent generation ("PASS" returned for the
    "reply with the single word PASS" prompt).
  * Pre-harmonized variant (token_embd / hc / compressor / indexer all
    F16, ffn_gate_inp F16): still works byte-for-byte the same as
    before this change, no F16 path regressions.

Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of
this: the unmodified cyberneurova file generates garbage (BOS spam)
when MPP F16 prefill is engaged, but produces coherent output with
DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's
MPP path alone and is independent of the changes here; it surfaces only
because this PR makes the Q8_0 file loadable in the first place.
This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the
validator level, but two follow-on Metal paths still treat them as F16
(or fall through to F32) and produce silently wrong output, which shows
up as <BOS>-token spam in generation for any prompt long enough to
exercise the multi-token compressor path on M-series hardware.

1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE
   byte-strided dequant) compiles cleanly and follows the same
   block_q8_0 indexing pattern used by other working Q8_0 kernels in
   dense.metal, but emits silently wrong values for the actual ape
   shapes (4 rows x 1024 cols of block_q8_0).  Confirmed by isolating
   the kernel: a CPU-side dequant of the same byte region matches
   gguf-py's `dequantize` reference byte-for-byte, while the Metal
   kernel's output is wrong.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store
   in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and
   fell through to F32 for everything else, so Q8_0 ape was reading
   garbage at decode time.

Fix:

* Replace the prefill APE Q8_0 path in
  `ds4_metal_encode_compressor_score_with_ape` and
  `ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant
  via two new helpers (`ds4_metal_half_bits_to_float` and
  `ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private
  MTLBuffer.  A per-call buffer is required because multiple CPU writes
  to the previously-shared `g_compressor_store_ape_buffer` within one
  command buffer collapse to the last write at execute time (Metal
  kernels run in encode order, but CPU writes don't participate in that
  ordering when the same scratch is reused).  The per-call buffer is
  retained until cb completion via `addCompletedHandler` because Metal
  does not strongly retain buffers bound to encoders.
* Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks
  block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block)
  inline.

The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is
no longer reached from the compressor paths; its registration in
ds4_metal.m is harmless and a future debug session can either fix it
or drop it.

Tested on macOS / M-series / Metal:

* make ds4-server clean (one pre-existing -Wpointer-sign warning from
  the unrelated MoE path).
* Cyberneurova Q2_K GGUF entirely unmodified, default flags:
  21-token prompt -> coherent generation
  ("An LLM, or Large Language Model, is a type of artificial intelligence").
  Previously this prompt generated a few coherent tokens then <BOS>
  token spam.
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
  F16): still works byte-for-byte the same as before this fix; no F16
  / F32 path regressions.
@audreyt audreyt marked this pull request as draft May 10, 2026 12:36
@fry69
Copy link
Copy Markdown

fry69 commented May 10, 2026

I am trying your PR with cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf

A simple "Hello" works:

$ ./ds4 --ctx 100000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf -p "Hello"
ds4: context buffers 2842.64 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.490 ms, residency requested in 477.369 ms, warmup 3.854 ms (mapped 94228.38 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: Metal backend initialized for graph diagnostics
We need to respond to the user's greeting. The user said "Hello". As a helpful assistant, I should respond politely and ask how I can assist.
Hello! How can I help you today?
ds4: prefill: 41.37 t/s, generation: 38.94 t/s

But a longer prompt produces an error like this, after about 2 lines of thinking tokens output:

$ ./ds4 --ctx 100000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf --prompt-file ./prompt.md
ds4: context buffers 2842.64 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.456 ms, residency requested in 467.611 ms, warmup 4.040 ms (mapped 94228.38 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: Metal backend initialized for graph diagnostics
[about 100 thinking tokens skipped]ds4: Metal graph indexer q projection expects F16 weights
ds4: decode failed: Metal decode failed

FWIW I think this is the smoking gun:

ds4: Metal graph indexer q projection expects F16 weights

@audreyt audreyt marked this pull request as ready for review May 10, 2026 18:13
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095)
still has two F16-only validators on indexer_attn_q_b and indexer_proj that I
missed in the initial loader pass.

These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e.
once the compressor has accumulated more rows than the decode-time top-k.
For short generations the path isn't reached; for ~400+ token generations
on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes
with finish_reason="error" / "Metal decode failed".

The downstream calls already use metal_graph_matmul_plain_tensor (which
dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time
validator at line 2211-2212 already uses tensor_expect_dispatch_layout,
which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16.

Reproducer (cyberneurova Q2_K, default flags): a "write a long story"
prompt that generates ~800 tokens hits the validator after ~400 tokens
and the request errors out. After this fix, the same prompt streams 800+
tokens cleanly.
@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 10, 2026

ds4: Metal graph indexer q projection expects F16 weights

Fixed in c2144e5!

@audreyt audreyt changed the title feat(loader): support stock-recipe (Q8_0/F32) GGUFs end-to-end on Metal feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal May 10, 2026
@fry69
Copy link
Copy Markdown

fry69 commented May 10, 2026

Great! Many thanks for the quick fix!

I can confirm that long prompts now work with the above mentioned model file and no longer produce an error.

There is a small likely cosmetic warning while compiling:

$ make clean && make
rm -f ds4 ds4-server ds4_native ds4_server_test ds4_test *.o
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_cli.o ds4_cli.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o linenoise.o linenoise.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4.o ds4.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -fobjc-arc -c -o ds4_metal.o ds4_metal.m
ds4_metal.m:8801:12: warning: unused function 'ds4_metal_encode_cpy_q8_0_f32_1d' [-Wunused-function]
 8801 | static int ds4_metal_encode_cpy_q8_0_f32_1d(
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4 ds4_cli.o linenoise.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework Metal
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_server.o ds4_server.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o rax.o rax.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4-server ds4_server.o rax.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework Metal

Also this PR currently is 4 commits behind main. From the looks of it, nothing should conflict, it can be merged cleanly.

Update: I checked this PR rebased against current main (22ca6ab) and it works also flawlessly with said model.

audreyt and others added 2 commits May 10, 2026 14:44
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb
(switched to CPU-side dequant to avoid an encode-time race on the shared
compressor scratch buffer), leaving the function unused and tripping
-Wunused-function on stock Make builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 10, 2026

There is a small likely cosmetic warning while compiling:

Fixed and synced from main. Ready for review from @antirez.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants