Skip to content

kv-cache: use -t threads for IQ4 packing from ggml code#22928

Open
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:kv-cache-iq4-threaded-packing
Open

kv-cache: use -t threads for IQ4 packing from ggml code#22928
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:kv-cache-iq4-threaded-packing

Conversation

@shikaku2
Copy link
Copy Markdown

kv-cache: use -t threads for IQ4 packing from ggml code

This PR hooks up threaded iq4_nl packing in GGML’s CPU copy-to-quant path. Respects -t THREADS argument. This makes iq4_nl quite fast and usable for large contexts as a good KV cache quantize candidate.

Comparison Table:

Type Threads Input MiB Packed MiB Seconds Input MiB/s RMSE K RMSE V
f32 1 723.812 1447.620 1.11001 652.075 0 0
f32 16 723.812 1447.620 1.10453 655.313 0 0
f16 1 723.812 723.812 1.74912 413.815 0 0
f16 16 723.812 723.812 1.74621 414.504 0 0
bf16 1 723.812 723.812 1.04079 695.446 0.00427049 0.000530655
bf16 16 723.812 723.812 1.05612 685.349 0.00427049 0.000530655
q8_0 1 723.812 384.525 2.07237 349.268 0.0184124 0.00188816
q8_0 16 723.812 384.525 1.09429 661.445 0.0184124 0.00188816
q5_0 1 723.812 248.811 1.57535 459.462 0.146387 0.0150381
q5_0 16 723.812 248.811 1.18220 612.259 0.146387 0.0150381
q5_1 1 723.812 271.430 1.50332 481.476 0.113917 0.0126468
q5_1 16 723.812 271.430 1.18375 611.458 0.113917 0.0126468
q4_0 1 723.812 203.572 1.15734 625.410 0.293315 0.0301958
q4_0 16 723.812 203.572 1.00654 719.109 0.293315 0.0301958
q4_1 1 723.812 226.191 1.10527 654.874 0.235459 0.0261354
q4_1 16 723.812 226.191 0.998586 724.838 0.235459 0.0261354
iq4_nl 1 723.812 203.572 59.5679 12.151 0.237420 0.0257450
iq4_nl 16 723.812 203.572 6.10200 118.619 0.237420 0.0257450

Speed with increasing thread count (tested on Ryzen 9 5950x):

Type Threads Input MiB Packed MiB Seconds Input MiB/s RMSE K RMSE V
iq4_nl 1 77.3438 21.7529 6.47471 11.9455 0.223493 0.0254825
iq4_nl 2 77.3438 21.7529 3.50913 22.0407 0.223493 0.0254825
iq4_nl 3 77.3438 21.7529 2.43187 31.8043 0.223493 0.0254825
iq4_nl 4 77.3438 21.7529 1.91244 40.4425 0.223493 0.0254825
iq4_nl 5 77.3438 21.7529 1.61934 47.7625 0.223493 0.0254825
iq4_nl 6 77.3438 21.7529 1.38227 55.9543 0.223493 0.0254825
iq4_nl 7 77.3438 21.7529 1.23703 62.5236 0.223493 0.0254825
iq4_nl 8 77.3438 21.7529 1.09624 70.5535 0.223493 0.0254825
iq4_nl 9 77.3438 21.7529 1.02711 75.302 0.223493 0.0254825
iq4_nl 10 77.3438 21.7529 0.920712 84.0043 0.223493 0.0254825
iq4_nl 11 77.3438 21.7529 0.873178 88.5773 0.223493 0.0254825
iq4_nl 12 77.3438 21.7529 0.80838 95.6775 0.223493 0.0254825
iq4_nl 13 77.3438 21.7529 0.794275 97.3765 0.223493 0.0254825
iq4_nl 14 77.3438 21.7529 0.70401 109.862 0.223493 0.0254825
iq4_nl 15 77.3438 21.7529 0.68224 113.367 0.223493 0.0254825
iq4_nl 16 77.3438 21.7529 0.670024 115.434 0.223493 0.0254825

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for benchmark script, making sure the functions are hooked up correctly.

@shikaku2 shikaku2 requested a review from ggerganov as a code owner May 11, 2026 01:06
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant