kv-cache: use `-t` threads for IQ4 packing from ggml code by shikaku2 · Pull Request #22928 · ggml-org/llama.cpp

shikaku2 · 2026-05-11T01:06:05Z

kv-cache: use `-t` threads for IQ4 packing from ggml code

This PR hooks up threaded iq4_nl packing in GGML’s CPU copy-to-quant path. Respects -t THREADS argument. This makes iq4_nl quite fast and usable for large contexts as a good KV cache quantize candidate.

Comparison Table:

Type	Threads	Input MiB	Packed MiB	Seconds	Input MiB/s	RMSE K	RMSE V
f32	1	723.812	1447.620	1.11001	652.075	0	0
f32	16	723.812	1447.620	1.10453	655.313	0	0
f16	1	723.812	723.812	1.74912	413.815	0	0
f16	16	723.812	723.812	1.74621	414.504	0	0
bf16	1	723.812	723.812	1.04079	695.446	0.00427049	0.000530655
bf16	16	723.812	723.812	1.05612	685.349	0.00427049	0.000530655
q8_0	1	723.812	384.525	2.07237	349.268	0.0184124	0.00188816
q8_0	16	723.812	384.525	1.09429	661.445	0.0184124	0.00188816
q5_0	1	723.812	248.811	1.57535	459.462	0.146387	0.0150381
q5_0	16	723.812	248.811	1.18220	612.259	0.146387	0.0150381
q5_1	1	723.812	271.430	1.50332	481.476	0.113917	0.0126468
q5_1	16	723.812	271.430	1.18375	611.458	0.113917	0.0126468
q4_0	1	723.812	203.572	1.15734	625.410	0.293315	0.0301958
q4_0	16	723.812	203.572	1.00654	719.109	0.293315	0.0301958
q4_1	1	723.812	226.191	1.10527	654.874	0.235459	0.0261354
q4_1	16	723.812	226.191	0.998586	724.838	0.235459	0.0261354
iq4_nl	1	723.812	203.572	59.5679	12.151	0.237420	0.0257450
iq4_nl	16	723.812	203.572	6.10200	118.619	0.237420	0.0257450

Speed with increasing thread count (tested on Ryzen 9 5950x):

Type	Threads	Input MiB	Packed MiB	Seconds	Input MiB/s	RMSE K	RMSE V
iq4_nl	1	77.3438	21.7529	6.47471	11.9455	0.223493	0.0254825
iq4_nl	2	77.3438	21.7529	3.50913	22.0407	0.223493	0.0254825
iq4_nl	3	77.3438	21.7529	2.43187	31.8043	0.223493	0.0254825
iq4_nl	4	77.3438	21.7529	1.91244	40.4425	0.223493	0.0254825
iq4_nl	5	77.3438	21.7529	1.61934	47.7625	0.223493	0.0254825
iq4_nl	6	77.3438	21.7529	1.38227	55.9543	0.223493	0.0254825
iq4_nl	7	77.3438	21.7529	1.23703	62.5236	0.223493	0.0254825
iq4_nl	8	77.3438	21.7529	1.09624	70.5535	0.223493	0.0254825
iq4_nl	9	77.3438	21.7529	1.02711	75.302	0.223493	0.0254825
iq4_nl	10	77.3438	21.7529	0.920712	84.0043	0.223493	0.0254825
iq4_nl	11	77.3438	21.7529	0.873178	88.5773	0.223493	0.0254825
iq4_nl	12	77.3438	21.7529	0.80838	95.6775	0.223493	0.0254825
iq4_nl	13	77.3438	21.7529	0.794275	97.3765	0.223493	0.0254825
iq4_nl	14	77.3438	21.7529	0.70401	109.862	0.223493	0.0254825
iq4_nl	15	77.3438	21.7529	0.68224	113.367	0.223493	0.0254825
iq4_nl	16	77.3438	21.7529	0.670024	115.434	0.223493	0.0254825

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for benchmark script, making sure the functions are hooked up correctly.

kv-cache: use -t threads for IQ4 packing from ggml code

d318221

shikaku2 requested a review from ggerganov as a code owner May 11, 2026 01:06

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache: use `-t` threads for IQ4 packing from ggml code#22928

kv-cache: use `-t` threads for IQ4 packing from ggml code#22928
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:kv-cache-iq4-threaded-packing

shikaku2 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shikaku2 commented May 11, 2026

kv-cache: use -t threads for IQ4 packing from ggml code

Comparison Table:

Speed with increasing thread count (tested on Ryzen 9 5950x):

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kv-cache: use `-t` threads for IQ4 packing from ggml code