feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60
feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60audreyt wants to merge 6 commits intoantirez:mainfrom
Conversation
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter
without per-tensor type overrides ship most of the small projections at
Q8_0 (and routed-expert router weights at F32) where the antirez recipe
keeps them at F16. Examples include the cyberneurova abliterated GGUFs.
On stock ds4 main these load fails loudly at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and
even after the validators are relaxed, several Metal kernel paths read
weight bytes directly via offset arithmetic that hard-codes F16/F32
strides.
This change makes the embed/HC/compressor/indexer/router validators
*and* the corresponding Metal kernel paths polymorphic, so the same
GGUF loads and runs with no harmonizer step.
Validators (ds4.c):
* New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0
and is applied to every projection that flows through a
type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn,
hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
* token_embd keeps its own inline F16/Q8_0 check because its CPU
embed kernel doesn't go through matvec_any.
* Two compressor decode-time guards (attn_compressor and
indexer_compressor pair-projection paths) relaxed from "F16 only"
to "F16 or Q8_0, paired type must match".
CPU paths (ds4.c):
* Refactor embed_token_f16 into an embed_token dispatcher; add
embed_token_q8_0 (block-wise dequant of block_q8_0).
* Replace the remaining direct matvec_f16 / matvec_f16_serial
callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing
matvec_any dispatcher; add matvec_any_serial for the HC pre/post
path.
* Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the
pair fuses with the existing F16-pair kernel when both tensors are
F16, otherwise dispatches to two single matmuls). All 22 hardcoded
ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
attn/indexer compressors, indexer projections, output head, router)
converted to use these wrappers.
Metal kernels:
* metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
dequantizes its source block on the fly).
* metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for
the multi-token F32 weight matmul that the F32 router path needs in
prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
* metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by
the compressor APE byte-strided reader).
Metal wiring (ds4_metal.m):
* Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at
init; clear them at cleanup.
* Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
ds4_metal_encode_get_rows helper take a new weight_type parameter
(GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
weights->token_embd->type unchanged. ds4_metal_embed_row_layout
picks the right per-row stride and pipeline.
* ds4_metal_matmul_f32_tensor extended with a multi-token branch
that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing
n_tok = 1 path unchanged.
* ds4_metal_encode_compressor_score_with_ape and the equivalent loop
in ds4_metal_compressor_prefill_tensor add a Q8_0 branch
(ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that
accounts for the block_q8_0 layout.
* Six ape_type validators relaxed to also accept 8 (Q8_0).
* Six ape_bytes calculations centralized through a new
ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the
correct stride for F16/F32/Q8_0.
* metal_graph_matmul_plain_tensor extended with a Q8_0 branch.
Tested on macOS / M-series / Metal:
* make ds4-server clean (no new warnings).
* Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill +
decode through to coherent generation ("PASS" returned for the
"reply with the single word PASS" prompt).
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
F16, ffn_gate_inp F16): still works byte-for-byte the same as
before this change, no F16 path regressions.
Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of
this: the unmodified cyberneurova file generates garbage (BOS spam)
when MPP F16 prefill is engaged, but produces coherent output with
DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's
MPP path alone and is independent of the changes here; it surfaces only
because this PR makes the Q8_0 file loadable in the first place.
This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the
validator level, but two follow-on Metal paths still treat them as F16
(or fall through to F32) and produce silently wrong output, which shows
up as <BOS>-token spam in generation for any prompt long enough to
exercise the multi-token compressor path on M-series hardware.
1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE
byte-strided dequant) compiles cleanly and follows the same
block_q8_0 indexing pattern used by other working Q8_0 kernels in
dense.metal, but emits silently wrong values for the actual ape
shapes (4 rows x 1024 cols of block_q8_0). Confirmed by isolating
the kernel: a CPU-side dequant of the same byte region matches
gguf-py's `dequantize` reference byte-for-byte, while the Metal
kernel's output is wrong.
2. `kernel_dsv4_compressor_store_one` (decode-time single-row store
in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and
fell through to F32 for everything else, so Q8_0 ape was reading
garbage at decode time.
Fix:
* Replace the prefill APE Q8_0 path in
`ds4_metal_encode_compressor_score_with_ape` and
`ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant
via two new helpers (`ds4_metal_half_bits_to_float` and
`ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private
MTLBuffer. A per-call buffer is required because multiple CPU writes
to the previously-shared `g_compressor_store_ape_buffer` within one
command buffer collapse to the last write at execute time (Metal
kernels run in encode order, but CPU writes don't participate in that
ordering when the same scratch is reused). The per-call buffer is
retained until cb completion via `addCompletedHandler` because Metal
does not strongly retain buffers bound to encoders.
* Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks
block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block)
inline.
The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is
no longer reached from the compressor paths; its registration in
ds4_metal.m is harmless and a future debug session can either fix it
or drop it.
Tested on macOS / M-series / Metal:
* make ds4-server clean (one pre-existing -Wpointer-sign warning from
the unrelated MoE path).
* Cyberneurova Q2_K GGUF entirely unmodified, default flags:
21-token prompt -> coherent generation
("An LLM, or Large Language Model, is a type of artificial intelligence").
Previously this prompt generated a few coherent tokens then <BOS>
token spam.
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
F16): still works byte-for-byte the same as before this fix; no F16
/ F32 path regressions.
|
I am trying your PR with A simple "Hello" works: $ ./ds4 --ctx 100000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf -p "Hello"
ds4: context buffers 2842.64 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.490 ms, residency requested in 477.369 ms, warmup 3.854 ms (mapped 94228.38 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: Metal backend initialized for graph diagnostics
We need to respond to the user's greeting. The user said "Hello". As a helpful assistant, I should respond politely and ask how I can assist.
Hello! How can I help you today?
ds4: prefill: 41.37 t/s, generation: 38.94 t/sBut a longer prompt produces an error like this, after about 2 lines of thinking tokens output: FWIW I think this is the smoking gun: |
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095) still has two F16-only validators on indexer_attn_q_b and indexer_proj that I missed in the initial loader pass. These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e. once the compressor has accumulated more rows than the decode-time top-k. For short generations the path isn't reached; for ~400+ token generations on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes with finish_reason="error" / "Metal decode failed". The downstream calls already use metal_graph_matmul_plain_tensor (which dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time validator at line 2211-2212 already uses tensor_expect_dispatch_layout, which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16. Reproducer (cyberneurova Q2_K, default flags): a "write a long story" prompt that generates ~800 tokens hits the validator after ~400 tokens and the request errors out. After this fix, the same prompt streams 800+ tokens cleanly.
Fixed in c2144e5! |
|
Great! Many thanks for the quick fix! I can confirm that long prompts now work with the above mentioned model file and no longer produce an error. There is a small likely cosmetic warning while compiling: $ make clean && make
rm -f ds4 ds4-server ds4_native ds4_server_test ds4_test *.o
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_cli.o ds4_cli.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o linenoise.o linenoise.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4.o ds4.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -fobjc-arc -c -o ds4_metal.o ds4_metal.m
ds4_metal.m:8801:12: warning: unused function 'ds4_metal_encode_cpy_q8_0_f32_1d' [-Wunused-function]
8801 | static int ds4_metal_encode_cpy_q8_0_f32_1d(
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4 ds4_cli.o linenoise.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework Metal
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o ds4_server.o ds4_server.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -c -o rax.o rax.c
cc -O3 -ffast-math -mcpu=native -Wall -Wextra -std=c99 -o ds4-server ds4_server.o rax.o ds4.o ds4_metal.o -lm -pthread -framework Foundation -framework MetalAlso this PR currently is 4 commits behind Update: I checked this PR rebased against current |
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb (switched to CPU-side dequant to avoid an encode-time race on the shared compressor scratch buffer), leaving the function unused and tripping -Wunused-function on stock Make builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixed and synced from main. Ready for review from @antirez. |
What this changes
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without
per-tensor type overrides ship most of the small projections at Q8_0 (and the
routed-expert router at F32) where the antirez recipe keeps them at F16.
Examples: the cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
models. On stock
ds4main these load fails at the first F16-strictvalidator (
token_embd, thenoutput_hc_fn, thenhc_attn_fn, …), and evenafter the validators are relaxed, several Metal kernel paths read weight bytes
directly via offset arithmetic that hard-codes F16/F32 strides and produce
silently wrong output for Q8_0.
This PR makes the embed / HC / compressor / indexer / router validators and
the corresponding Metal kernel paths polymorphic, so the same GGUF loads and
runs on Metal end-to-end on audreyt/pi-ds4.
Validators (
ds4.c)tensor_expect_dispatch_layouthelper accepts F16, F32, or Q8_0 and isapplied to every projection that flows through a type-dispatching
matvec/matmul:
output_hc_fn,hc_attn_fn,hc_ffn_fn,attn_compressor_{ape,gate,kv},indexer.{attn_q_b,proj},indexer_compressor_{ape,gate,kv},ffn_gate_inp.token_embdkeeps its own inline F16/Q8_0 check because its CPU embedkernel doesn't go through
matvec_any.attn_compressorandindexer_compressorpair-projection paths) relaxed from "F16 only" to"F16 or Q8_0, paired type must match".
CPU paths (
ds4.c)embed_token_f16into anembed_tokendispatcher; addembed_token_q8_0(block-wise dequant ofblock_q8_0).matvec_f16/matvec_f16_serialcallers(HC fn,
output_hc_fn,ffn_gate_inp) with the existingmatvec_anydispatcher; add
matvec_any_serialfor the HC pre/post path.metal_graph_matmul_plain_tensorand
metal_graph_matmul_pair_plain_tensor(extended for Q8_0; the pairfuses with the existing F16-pair kernel when both tensors are F16,
otherwise dispatches to two single matmuls). All 22 hardcoded
ds4_metal_matmul_f16{,_pair}_tensorcall sites inds4.c(HC mix,attn/indexer compressors, indexer projections, output head, router)
converted to use these wrappers.
Metal kernels
metal/get_rows.metal:kernel_get_rows_q8_0(one float per thread,dequantizes its source block on the fly).
metal/dense.metal:kernel_mul_mm_f32_f32template instantiation for themulti-token F32 weight matmul that the F32 router path needs in prefill
(mirrors the existing F16/Q8_0
mul_mm_tinstantiations).metal/dsv4_kv.metal: a Q8_0 branch added tokernel_dsv4_compressor_store_one. Without this, the decode-timesingle-row compressor store treats Q8_0 ape as F32 and reads garbage.
Metal wiring (
ds4_metal.m)g_get_rows_q8_0_pipelineat init; clear at cleanup.ds4_metal_embed_{token,tokens}_hc_tensorand the sharedds4_metal_encode_get_rowshelper take a newweight_typeparameter(GGUF type code: 1=F16, 8=Q8_0). 8 callers in
ds4.cforwardweights->token_embd->typeunchanged.ds4_metal_embed_row_layoutpicksthe right per-row stride and pipeline.
ds4_metal_matmul_f32_tensorextended with a multi-token branch thatdispatches to
kernel_mul_mm_f32_f32(n_tok > 1); existingn_tok = 1path unchanged.
ds4_metal_encode_compressor_score_with_apeand the equivalent loop inds4_metal_compressor_store_batch_tensor: for Q8_0 ape, dequantize on theCPU into a per-call private
MTLBufferand feed that into the existingadd_f32_1d. Two new helpers (ds4_metal_half_bits_to_float,ds4_metal_cpu_dequant_q8_0_rows) implement the conversion; the CPUdequant matches
gguf-py'sdequantizereference byte-for-byte (verifiedin a standalone numeric check). A per-call buffer is required because
multiple CPU writes to the previously-shared
g_compressor_store_ape_bufferwithin one command buffer collapse to thelast write at execute time (Metal kernels run in encode order, but CPU
writes don't participate in that ordering when the same scratch is
reused). The per-call buffer is retained until cb completion via
addCompletedHandlerbecause Metal does not strongly retain buffersbound to encoders.
ape_typevalidators relaxed to also accept8(Q8_0).ape_bytescalculations centralized through a newds4_metal_ape_bytes(ape_type, n_elems)helper that returns the correctstride for F16 / F32 / Q8_0.
metal_graph_matmul_plain_tensorextended with a Q8_0 branch.Why CPU dequant for Q8_0 ape (and not a Metal kernel)
I first wrote a
kernel_cpy_q8_0_f32Metal kernel using the sameblock_q8_0indexing pattern that the working dense Q8_0 matvec/matmulkernels in
metal/dense.metaluse. It compiled cleanly but producedsilently wrong values for the actual compressor APE shapes (4 rows × 1024
cols of
block_q8_0). I confirmed this by a side-by-side numeric checkagainst
gguf-py'sdequantizereference: my CPU dequant matchesbyte-for-byte; the Metal kernel does not. I left
kernel_cpy_q8_0_f32inmetal/cpy.metal(its registration inds4_metal.mis harmless) so that afuture debug session can pick it up; the compressor paths use the CPU
dequant as the active route.
What this PR does not cover
ds4.c:5198, 5291) still hardcodesif (gate_w->type != IQ2_XXS) ds4_die(...). That path is reference/debugper
AGENT.mdand the production Metal flow doesn't touch it. Ifsomething forces CPU fallback (Metal disabled, MTP without Metal, certain
trace modes) on a stock-recipe Q8_0 GGUF you'll see "expected IQ2_XXS
expert tensors" and need a Q8_0 dispatch added there too. Out of scope
for this PR; the production Metal flow is fine.
a loader/dispatcher change so existing GGUFs that happen to use the
stock recipe become loadable.
Test matrix (macOS / M5/ Metal)
make ds4-serverclean (one pre-existing-Wpointer-signwarning fromthe unrelated MoE path, not introduced by this PR).
21-token prompt → coherent generation
("An LLM, or Large Language Model, is a type of artificial intelligence").
Without the compressor APE fix, this prompt generated a few coherent
tokens then
<BOS>token spam.token_embd/ HC / compressor / indexer allF16): still works byte-for-byte the same as before, no F16/F32 path
regressions.
make ds4-serverbuild clean across both branches.