Skip to content

CrazyAngelm/megaquant-kv-cache

Repository files navigation

MegaQuant KV-Cache

Research PoC CPU/Python Metadata aware

Low-bit KV-cache compression experiments with honest metadata accounting.

At a glance

In a local GPT-2 CPU/Python fake-quant benchmark, MegaQuant's best current point gives:

What you care about Result Compared with
Modeled KV payload size 19.6% of FP16 5.11x compression / 80.4% saving vs FP16
Attention-output quality +11.0% higher vs local RotorQuant-3b baseline (0.942270 vs 0.848665)
Memory cost vs RotorQuant-3b +4.35% more 3.130399 vs 3.000000 effective bits/dim

Main method:

affine_seven_level_3bit_g64_meta4
3.130399 effective bits/dim
0.942270 attention-output cosine

Need lower memory? The 2-bit Hadamard variant uses 24.8% less modeled memory than local RotorQuant-3b (2.255399 vs 3.000000 bits/dim) while landing in the same attention-output-cosine range in this benchmark (0.851023 vs 0.848665).

Scope

This repository is a research proof-of-concept, not a production inference engine.

The numbers above are:

  • from a narrow GPT-2 KV-cache quality benchmark,
  • CPU/Python fake-quant results,
  • based on modeled effective_bits_per_dim including declared metadata,
  • comparisons against local Python baseline implementations.

They are not claims about CUDA kernels, real VRAM, decode throughput, or general superiority across LLMs.

Related repository

RAG/vector-index companion project:

What effective_bits_per_dim means

effective_bits_per_dim is a theoretical accounting model for quantized code bits plus declared metadata bits. It is not measured packed tensor storage, kernel layout overhead, or actual GPU VRAM usage.

For example, affine_seven_level_3bit_g64_meta4 uses 3 code bits plus simulated int4 scale/zero metadata amortized over group size 64, giving 3.130399 effective bits/dim in the public conservative accounting used for this benchmark.

Small implementation overheads such as padding, headers, estimator state, and runtime bookkeeping are not measured here.

For simulated meta4/meta8 methods, the public tables add a small conservative term for shared metadata-range parameters. This is still a modeled storage budget, not a packed-kernel measurement.

Benchmark setup

model: GPT-2 local
text: local SQuAD corpus
tokens: 384
layers: 4
heads: 4
query positions: 96,191,287,383
cases: 64
runtime: CPU/Python fake quantization
primary metric: attention-output cosine

Attention-output cosine and score cosine are proxy quality metrics. They are not perplexity, downstream generation quality, latency, or VRAM measurements.

Recommended frontier table

This is a selected frontier table, not the full sweep. See results/_combo_full.csv, results/_combo_full_accounting_corrected.csv, and results/_combo_final_table.md for more rows. _combo_full.csv has been realigned to the same conservative accounting model; _combo_full_accounting_corrected.csv is retained as an explicit audit artifact.

Class Method Effective bits/dim Attn-out cosine Score cosine Notes
External 2-bit best IsoQuant-2b 2.000000 0.735624 0.971382 best among locally reproduced external 2-bit baselines by attn-out
MegaQuant near-2.16-bit offline proxy sink1_channelwise_four_level_2bit 2.161955 0.835087 0.985141 offline channelwise proxy; corrected sink-token/channel-scale accounting
MegaQuant 2.26-bit hadamard_affine_four_level_2bit_g64_meta8 2.255399 0.851023 0.987475 low-bit Pareto point; slightly higher attn-out than local RotorQuant-3b in this benchmark
External 3-bit best RotorQuant-3b 3.000000 0.848665 0.988019 best among locally reproduced external 3-bit baselines by attn-out
MegaQuant strict 3-bit hadamard_affine_four_level_2bit_g32 3.000000 0.859203 0.990216 best strict 3.0 effective-bit MegaQuant in this benchmark
MegaQuant main point affine_seven_level_3bit_g64_meta4 3.130399 0.942270 0.997257 best observed quality/compression tradeoff in this benchmark; includes conservative metadata-range overhead
MegaQuant high-quality ref affine_seven_level_3bit_g32 4.000000 0.948879 0.997991 high-quality reference

Selected methods

sink1_channelwise_four_level_2bit

Near-2.16-bit offline proxy. It quantizes channels with a four-level 2-bit codebook and keeps the first/sink token in FP16. Because channelwise scales are estimated over the evaluated prefix, this point is reported as an offline proxy rather than a proven streamable KV-cache format.

effective_bits_per_dim = 2.161955
attn_out_cos_mean      = 0.835087

hadamard_affine_four_level_2bit_g64_meta8

Best 2.26-bit Pareto point in this benchmark. It applies fixed Hadamard/RHT-style preconditioning, then asymmetric 2-bit affine quantization with group size 64 and int8 scale/zero metadata.

effective_bits_per_dim = 2.255399
attn_out_cos_mean      = 0.851023

hadamard_affine_four_level_2bit_g32

Best strict 3.0 effective-bit method in this benchmark.

effective_bits_per_dim = 3.000000
attn_out_cos_mean      = 0.859203

affine_seven_level_3bit_g64_meta4

Main observed winner in this benchmark. It uses asymmetric 3-bit affine quantization with group size 64 and simulated int4 scale/zero metadata.

effective_bits_per_dim = 3.130399
attn_out_cos_mean      = 0.942270

affine_seven_level_3bit_g32

High-quality reference.

effective_bits_per_dim = 4.000000
attn_out_cos_mean      = 0.948879

Reproduce

Install dependencies:

python -m pip install -r requirements.txt

Place SQuAD files in the repository root as described in DATA.md.

Run MegaQuant-only benchmark from the repository root:

python scripts/benchmark_real_model.py \
  --skip-external \
  --model gpt2 \
  --local-only \
  --rotorquant-path none \
  --output-csv results/_combo_full.csv \
  --output-md results/_combo_full.md

The selected final comparison table is available at:

results/_combo_final_table.md
results/_combo_full_accounting_corrected.csv
results/publishable_comparison.md
results/efficient_frontier_report.md

Project files

  • methods/efficient_frontier.md - method cards for the selected frontier methods.
  • results/efficient_frontier_report.md - research-style report and selected method rationale.
  • results/publishable_comparison.md - compact comparison table for presentation.
  • results/_combo_final_table.md - selected full comparison table.
  • results/research_ideas_next_round.md - next-step idea bank.

Changelog

  • CHANGELOG.md

Related prior-work topics

The methods are inspired by common quantization and compression ideas such as groupwise affine quantization, metadata-aware accounting, Hadamard/RHT-style preconditioning, attention sink protection, and prior KV-cache work such as KIVI and KVQuant. Comparisons to RotorQuant/TurboQuant/IsoQuant/PlanarQuant in this repo are local benchmark comparisons only.

Honest limitations

This project currently demonstrates a CPU/Python fake-quant quality frontier.

Not yet proven:

  • packed 2/3-bit storage,
  • CUDA kernels,
  • throughput/latency,
  • real VRAM savings,
  • long-context evaluation,
  • LLaMA/Qwen/Mistral-family models,
  • perplexity or downstream generation quality,
  • exact reproduction of official external baseline kernels.

Conservative claim:

In this CPU-only GPT-2 fake-quant benchmark, under a theoretical effective-bit accounting model, selected MegaQuant methods show a stronger observed attention-output-cosine/compression frontier than the tested local Python baseline implementations.


Repository positioning

This repository is public as a research PoC. It is not a production inference engine.

Suggested GitHub topics after public release:

llm kv-cache quantization compression low-bit inference research

About

Efficient low-bit KV-cache compression research with honest metadata accounting.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages