Low-bit KV-cache compression experiments with honest metadata accounting.
In a local GPT-2 CPU/Python fake-quant benchmark, MegaQuant's best current point gives:
| What you care about | Result | Compared with |
|---|---|---|
| Modeled KV payload size | 19.6% of FP16 | 5.11x compression / 80.4% saving vs FP16 |
| Attention-output quality | +11.0% higher | vs local RotorQuant-3b baseline (0.942270 vs 0.848665) |
| Memory cost vs RotorQuant-3b | +4.35% more | 3.130399 vs 3.000000 effective bits/dim |
Main method:
affine_seven_level_3bit_g64_meta4
3.130399 effective bits/dim
0.942270 attention-output cosine
Need lower memory? The 2-bit Hadamard variant uses 24.8% less modeled memory than local RotorQuant-3b (2.255399 vs 3.000000 bits/dim) while landing in the same attention-output-cosine range in this benchmark (0.851023 vs 0.848665).
This repository is a research proof-of-concept, not a production inference engine.
The numbers above are:
- from a narrow GPT-2 KV-cache quality benchmark,
- CPU/Python fake-quant results,
- based on modeled
effective_bits_per_dimincluding declared metadata, - comparisons against local Python baseline implementations.
They are not claims about CUDA kernels, real VRAM, decode throughput, or general superiority across LLMs.
RAG/vector-index companion project:
effective_bits_per_dim is a theoretical accounting model for quantized code bits plus declared metadata bits. It is not measured packed tensor storage, kernel layout overhead, or actual GPU VRAM usage.
For example, affine_seven_level_3bit_g64_meta4 uses 3 code bits plus simulated int4 scale/zero metadata amortized over group size 64, giving 3.130399 effective bits/dim in the public conservative accounting used for this benchmark.
Small implementation overheads such as padding, headers, estimator state, and runtime bookkeeping are not measured here.
For simulated meta4/meta8 methods, the public tables add a small conservative term for shared metadata-range parameters. This is still a modeled storage budget, not a packed-kernel measurement.
model: GPT-2 local
text: local SQuAD corpus
tokens: 384
layers: 4
heads: 4
query positions: 96,191,287,383
cases: 64
runtime: CPU/Python fake quantization
primary metric: attention-output cosine
Attention-output cosine and score cosine are proxy quality metrics. They are not perplexity, downstream generation quality, latency, or VRAM measurements.
This is a selected frontier table, not the full sweep. See results/_combo_full.csv, results/_combo_full_accounting_corrected.csv, and results/_combo_final_table.md for more rows. _combo_full.csv has been realigned to the same conservative accounting model; _combo_full_accounting_corrected.csv is retained as an explicit audit artifact.
| Class | Method | Effective bits/dim | Attn-out cosine | Score cosine | Notes |
|---|---|---|---|---|---|
| External 2-bit best | IsoQuant-2b |
2.000000 | 0.735624 | 0.971382 | best among locally reproduced external 2-bit baselines by attn-out |
| MegaQuant near-2.16-bit offline proxy | sink1_channelwise_four_level_2bit |
2.161955 | 0.835087 | 0.985141 | offline channelwise proxy; corrected sink-token/channel-scale accounting |
| MegaQuant 2.26-bit | hadamard_affine_four_level_2bit_g64_meta8 |
2.255399 | 0.851023 | 0.987475 | low-bit Pareto point; slightly higher attn-out than local RotorQuant-3b in this benchmark |
| External 3-bit best | RotorQuant-3b |
3.000000 | 0.848665 | 0.988019 | best among locally reproduced external 3-bit baselines by attn-out |
| MegaQuant strict 3-bit | hadamard_affine_four_level_2bit_g32 |
3.000000 | 0.859203 | 0.990216 | best strict 3.0 effective-bit MegaQuant in this benchmark |
| MegaQuant main point | affine_seven_level_3bit_g64_meta4 |
3.130399 | 0.942270 | 0.997257 | best observed quality/compression tradeoff in this benchmark; includes conservative metadata-range overhead |
| MegaQuant high-quality ref | affine_seven_level_3bit_g32 |
4.000000 | 0.948879 | 0.997991 | high-quality reference |
Near-2.16-bit offline proxy. It quantizes channels with a four-level 2-bit codebook and keeps the first/sink token in FP16. Because channelwise scales are estimated over the evaluated prefix, this point is reported as an offline proxy rather than a proven streamable KV-cache format.
effective_bits_per_dim = 2.161955
attn_out_cos_mean = 0.835087
Best 2.26-bit Pareto point in this benchmark. It applies fixed Hadamard/RHT-style preconditioning, then asymmetric 2-bit affine quantization with group size 64 and int8 scale/zero metadata.
effective_bits_per_dim = 2.255399
attn_out_cos_mean = 0.851023
Best strict 3.0 effective-bit method in this benchmark.
effective_bits_per_dim = 3.000000
attn_out_cos_mean = 0.859203
Main observed winner in this benchmark. It uses asymmetric 3-bit affine quantization with group size 64 and simulated int4 scale/zero metadata.
effective_bits_per_dim = 3.130399
attn_out_cos_mean = 0.942270
High-quality reference.
effective_bits_per_dim = 4.000000
attn_out_cos_mean = 0.948879
Install dependencies:
python -m pip install -r requirements.txtPlace SQuAD files in the repository root as described in DATA.md.
Run MegaQuant-only benchmark from the repository root:
python scripts/benchmark_real_model.py \
--skip-external \
--model gpt2 \
--local-only \
--rotorquant-path none \
--output-csv results/_combo_full.csv \
--output-md results/_combo_full.mdThe selected final comparison table is available at:
results/_combo_final_table.md
results/_combo_full_accounting_corrected.csv
results/publishable_comparison.md
results/efficient_frontier_report.md
methods/efficient_frontier.md- method cards for the selected frontier methods.results/efficient_frontier_report.md- research-style report and selected method rationale.results/publishable_comparison.md- compact comparison table for presentation.results/_combo_final_table.md- selected full comparison table.results/research_ideas_next_round.md- next-step idea bank.
CHANGELOG.md
The methods are inspired by common quantization and compression ideas such as groupwise affine quantization, metadata-aware accounting, Hadamard/RHT-style preconditioning, attention sink protection, and prior KV-cache work such as KIVI and KVQuant. Comparisons to RotorQuant/TurboQuant/IsoQuant/PlanarQuant in this repo are local benchmark comparisons only.
This project currently demonstrates a CPU/Python fake-quant quality frontier.
Not yet proven:
- packed 2/3-bit storage,
- CUDA kernels,
- throughput/latency,
- real VRAM savings,
- long-context evaluation,
- LLaMA/Qwen/Mistral-family models,
- perplexity or downstream generation quality,
- exact reproduction of official external baseline kernels.
Conservative claim:
In this CPU-only GPT-2 fake-quant benchmark, under a theoretical effective-bit accounting model, selected MegaQuant methods show a stronger observed attention-output-cosine/compression frontier than the tested local Python baseline implementations.
This repository is public as a research PoC. It is not a production inference engine.
Suggested GitHub topics after public release:
llm kv-cache quantization compression low-bit inference research