MegaQuant KV-Cache

Low-bit KV-cache compression experiments with honest metadata accounting.

At a glance

In a local GPT-2 CPU/Python fake-quant benchmark, MegaQuant's best current point gives:

What you care about	Result	Compared with
Modeled KV payload size	19.6% of FP16	5.11x compression / 80.4% saving vs FP16
Attention-output quality	+11.0% higher	vs local `RotorQuant-3b` baseline (`0.942270` vs `0.848665`)
Memory cost vs RotorQuant-3b	+4.35% more	`3.130399` vs `3.000000` effective bits/dim

Main method:

affine_seven_level_3bit_g64_meta4
3.130399 effective bits/dim
0.942270 attention-output cosine

Need lower memory? The 2-bit Hadamard variant uses 24.8% less modeled memory than local RotorQuant-3b (2.255399 vs 3.000000 bits/dim) while landing in the same attention-output-cosine range in this benchmark (0.851023 vs 0.848665).

Scope

This repository is a research proof-of-concept, not a production inference engine.

The numbers above are:

from a narrow GPT-2 KV-cache quality benchmark,
CPU/Python fake-quant results,
based on modeled effective_bits_per_dim including declared metadata,
comparisons against local Python baseline implementations.

They are not claims about CUDA kernels, real VRAM, decode throughput, or general superiority across LLMs.

Related repository

RAG/vector-index companion project:

https://github.com/CrazyAngelm/megaquant-rag-compress

What `effective_bits_per_dim` means

effective_bits_per_dim is a theoretical accounting model for quantized code bits plus declared metadata bits. It is not measured packed tensor storage, kernel layout overhead, or actual GPU VRAM usage.

For example, affine_seven_level_3bit_g64_meta4 uses 3 code bits plus simulated int4 scale/zero metadata amortized over group size 64, giving 3.130399 effective bits/dim in the public conservative accounting used for this benchmark.

Small implementation overheads such as padding, headers, estimator state, and runtime bookkeeping are not measured here.

For simulated meta4/meta8 methods, the public tables add a small conservative term for shared metadata-range parameters. This is still a modeled storage budget, not a packed-kernel measurement.

Benchmark setup

model: GPT-2 local
text: local SQuAD corpus
tokens: 384
layers: 4
heads: 4
query positions: 96,191,287,383
cases: 64
runtime: CPU/Python fake quantization
primary metric: attention-output cosine

Attention-output cosine and score cosine are proxy quality metrics. They are not perplexity, downstream generation quality, latency, or VRAM measurements.

Recommended frontier table

This is a selected frontier table, not the full sweep. See results/_combo_full.csv, results/_combo_full_accounting_corrected.csv, and results/_combo_final_table.md for more rows. _combo_full.csv has been realigned to the same conservative accounting model; _combo_full_accounting_corrected.csv is retained as an explicit audit artifact.

Class	Method	Effective bits/dim	Attn-out cosine	Score cosine	Notes
External 2-bit best	`IsoQuant-2b`	2.000000	0.735624	0.971382	best among locally reproduced external 2-bit baselines by attn-out
MegaQuant near-2.16-bit offline proxy	`sink1_channelwise_four_level_2bit`	2.161955	0.835087	0.985141	offline channelwise proxy; corrected sink-token/channel-scale accounting
MegaQuant 2.26-bit	`hadamard_affine_four_level_2bit_g64_meta8`	2.255399	0.851023	0.987475	low-bit Pareto point; slightly higher attn-out than local RotorQuant-3b in this benchmark
External 3-bit best	`RotorQuant-3b`	3.000000	0.848665	0.988019	best among locally reproduced external 3-bit baselines by attn-out
MegaQuant strict 3-bit	`hadamard_affine_four_level_2bit_g32`	3.000000	0.859203	0.990216	best strict 3.0 effective-bit MegaQuant in this benchmark
MegaQuant main point	`affine_seven_level_3bit_g64_meta4`	3.130399	0.942270	0.997257	best observed quality/compression tradeoff in this benchmark; includes conservative metadata-range overhead
MegaQuant high-quality ref	`affine_seven_level_3bit_g32`	4.000000	0.948879	0.997991	high-quality reference

Selected methods

`sink1_channelwise_four_level_2bit`

Near-2.16-bit offline proxy. It quantizes channels with a four-level 2-bit codebook and keeps the first/sink token in FP16. Because channelwise scales are estimated over the evaluated prefix, this point is reported as an offline proxy rather than a proven streamable KV-cache format.

effective_bits_per_dim = 2.161955
attn_out_cos_mean      = 0.835087

`hadamard_affine_four_level_2bit_g64_meta8`

Best 2.26-bit Pareto point in this benchmark. It applies fixed Hadamard/RHT-style preconditioning, then asymmetric 2-bit affine quantization with group size 64 and int8 scale/zero metadata.

effective_bits_per_dim = 2.255399
attn_out_cos_mean      = 0.851023

`hadamard_affine_four_level_2bit_g32`

Best strict 3.0 effective-bit method in this benchmark.

effective_bits_per_dim = 3.000000
attn_out_cos_mean      = 0.859203

`affine_seven_level_3bit_g64_meta4`

Main observed winner in this benchmark. It uses asymmetric 3-bit affine quantization with group size 64 and simulated int4 scale/zero metadata.

effective_bits_per_dim = 3.130399
attn_out_cos_mean      = 0.942270

`affine_seven_level_3bit_g32`

High-quality reference.

effective_bits_per_dim = 4.000000
attn_out_cos_mean      = 0.948879

Reproduce

Install dependencies:

python -m pip install -r requirements.txt

Place SQuAD files in the repository root as described in DATA.md.

Run MegaQuant-only benchmark from the repository root:

python scripts/benchmark_real_model.py \
  --skip-external \
  --model gpt2 \
  --local-only \
  --rotorquant-path none \
  --output-csv results/_combo_full.csv \
  --output-md results/_combo_full.md

The selected final comparison table is available at:

results/_combo_final_table.md
results/_combo_full_accounting_corrected.csv
results/publishable_comparison.md
results/efficient_frontier_report.md

Project files

methods/efficient_frontier.md - method cards for the selected frontier methods.
results/efficient_frontier_report.md - research-style report and selected method rationale.
results/publishable_comparison.md - compact comparison table for presentation.
results/_combo_final_table.md - selected full comparison table.
results/research_ideas_next_round.md - next-step idea bank.

Changelog

CHANGELOG.md

Honest limitations

This project currently demonstrates a CPU/Python fake-quant quality frontier.

Not yet proven:

packed 2/3-bit storage,
CUDA kernels,
throughput/latency,
real VRAM savings,
long-context evaluation,
LLaMA/Qwen/Mistral-family models,
perplexity or downstream generation quality,
exact reproduction of official external baseline kernels.

Conservative claim:

In this CPU-only GPT-2 fake-quant benchmark, under a theoretical effective-bit accounting model, selected MegaQuant methods show a stronger observed attention-output-cosine/compression frontier than the tested local Python baseline implementations.

Repository positioning

This repository is public as a research PoC. It is not a production inference engine.

Suggested GitHub topics after public release:

llm kv-cache quantization compression low-bit inference research

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
methods		methods
results		results
scripts		scripts
src/megaquant_hdc		src/megaquant_hdc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
hdc_rotor_v2_compact.py		hdc_rotor_v2_compact.py
pyproject.toml		pyproject.toml
real_bench_hdc_quant_v2.py		real_bench_hdc_quant_v2.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaQuant KV-Cache

At a glance

Scope

Related repository

What `effective_bits_per_dim` means

Benchmark setup

Recommended frontier table

Selected methods

`sink1_channelwise_four_level_2bit`

`hadamard_affine_four_level_2bit_g64_meta8`

`hadamard_affine_four_level_2bit_g32`

`affine_seven_level_3bit_g64_meta4`

`affine_seven_level_3bit_g32`

Reproduce

Project files

Changelog

Related prior-work topics

Honest limitations

Repository positioning

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MegaQuant KV-Cache

At a glance

Scope

Related repository

What effective_bits_per_dim means

Benchmark setup

Recommended frontier table

Selected methods

sink1_channelwise_four_level_2bit

hadamard_affine_four_level_2bit_g64_meta8

hadamard_affine_four_level_2bit_g32

affine_seven_level_3bit_g64_meta4

affine_seven_level_3bit_g32

Reproduce

Project files

Changelog

Related prior-work topics

Honest limitations

Repository positioning

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `effective_bits_per_dim` means

`sink1_channelwise_four_level_2bit`

`hadamard_affine_four_level_2bit_g64_meta8`

`hadamard_affine_four_level_2bit_g32`

`affine_seven_level_3bit_g64_meta4`

`affine_seven_level_3bit_g32`

Packages