Skip to content

utkrisht3108/Reasoning-LLMs-Profiling-KV-Cache-Quantization-Overhead

Repository files navigation

Profiling KV Cache Quantization Overhead in Reasoning LLMs

A kernel-level systems study of what really happens when you turn on KV cache quantization for reasoning LLMs. We instrument decode-time inference of DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-7B on an A100, sweep across precisions (FP16, INT8/HQQ, INT4/quanto), output lengths (256–2048 tokens), and residual_length settings (32–512), and measure throughput, peak GPU memory, energy per token, per-CUDA-kernel time, and downstream accuracy on MATH and MMLU.

Course: 15-642 (CMU). Final report and poster live in report/.

Why this project

Modern reasoning models (DeepSeek-R1, etc.) generate long chains of thought, which means decode now runs for hundreds or thousands of tokens per request. The KV cache scales linearly with output length, so it gets pitched as the obvious place to compress: store keys and values in INT8 or INT4, save the bytes, ship more concurrent users on the same GPU.

The catch is that quantized cache entries have to be dequantized before attention can use them, so every decode step pays an extra dequantization cost on a cache that grows the longer the model thinks. Most published work on KV cache compression reports aggregate throughput and accuracy and stops there. We wanted to see what's actually going on at the kernel level: where does the overhead live, does the memory-bandwidth saving show up in practice on a 7B model, and how does any of this interact with the long decode phase that reasoning models spend most of their time in.

What's in the repo

The src/ folder is the measurement harness. It wraps model.generate() in PyTorch Profiler with both CPU and CUDA activities recorded, samples GPU power through NVML in a background thread, and dumps everything to JSON. Kernel events are mapped into eight semantic categories (attention, FFN, dequantization, elementwise, normalization, memory, sampling, other) using regex pattern matching against the kernel names that PyTorch hands us. INT4/quanto exposes named dequantization kernels (QBitsDequantizer, quanto::unpack), so for that backend dequantization shows up as a clean line on the breakdown. INT8/HQQ does not expose a named kernel, so its dequantization cost is recovered as a differential against the FP16 baseline, mostly hiding inside elementwise ops like aten::copy_, aten::sub, and aten::mul.

The Modal entry points (run_*.py at the repo root) drive everything from a single command: run_profiling.py for the 32 systems profiles, run_accuracy.py for the MATH-100 / MMLU sweeps, run_validation.py for a 3-config smoke test, and run_int4_ablations.py for the small INT4 corruption diagnostic.

The raw artifacts in results/raw/ are one JSON per profiling configuration, with kernel-level traces, memory snapshots, and per-step power samples. results/accuracy/ is one JSON per accuracy run. results/processed/summary.csv is a 32-row CSV that aggregates the systems metrics, and results/figures/ is where analyze.py writes the publication figures used in the report and poster.

.
├── src/
│   ├── config.py                 # 34-config experiment matrix
│   ├── utils.py                  # model loading, GenerationConfig, EnergyMonitor
│   ├── profiler.py               # PyTorch Profiler wrapper, NVML power sampling
│   ├── kernel_categories.py      # 8-category kernel classifier
│   ├── memory_breakdown.py       # weights vs cache vs framework split
│   ├── evaluate_accuracy.py      # MATH-100 + MMLU eval, balanced \boxed{} parsing
│   └── analyze.py                # plots + summary.csv from raw JSON
├── run_profiling.py              # Modal entry: 32-config systems profile
├── run_accuracy.py               # Modal entry: MATH + MMLU
├── run_validation.py             # Modal entry: 3-config smoke test
├── run_int4_ablations.py         # INT4/quanto corruption diagnostic
├── scripts/
│   └── validate_accuracy_artifacts.py
├── results/
│   ├── raw/                      # one JSON per profiled config
│   ├── accuracy/                 # one JSON per accuracy run
│   ├── diagnostics/              # INT4 ablations
│   ├── processed/summary.csv     # aggregated systems metrics (32 rows)
│   └── figures/                  # PNGs used in report + poster
└── report/
    ├── Final_Report.pdf
    └── Poster.pdf

Headline results

The full numbers and their interpretation are in report/Final_Report.pdf; the very short version is below.

At our scale (7B model, batch size 1, single A100-40GB), KV cache quantization did not save peak memory. All twelve 2048-token configurations sit between 15.26 and 15.30 GB, because model weights and framework overhead dwarf the cache. The KV cache itself is only about 0.03 GB at this point, so compressing it from FP16 to INT4 saves bytes that disappear into rounding noise.

Quantization also did not buy a reliable throughput win. DeepSeek-R1 INT4/quanto runs about 7.3% faster than FP16 at 2048 tokens, but Qwen2.5-7B INT8/HQQ is 20.5% slower than FP16 at the same point, and INT4 is 13% slower. Same backends, different models, opposite signs.

Energy goes the wrong way. FP16 is the most energy-efficient precision in every configuration we measured. DeepSeek INT8 spends 42.4% more joules per generated token than FP16 (13.23 vs 9.29 J/tok), and the penalty does not amortize at longer sequences. The likely cause, visible in the kernel breakdown, is that dequantization adds CUDA work at every decode step (1.2% of CUDA time as a named kernel for INT4, plus a 3–4 percentage-point bump in elementwise time for INT8) without producing a matching reduction in DRAM traffic, since the cache is too small to begin with.

Accuracy is where it gets interesting. MMLU stayed flat across all precisions on both models (DeepSeek 38.0%, Qwen 59.5–60.5%), so a deployment evaluating only short multiple-choice answers would see no problem. MATH was a different story. INT8/HQQ preserved reasoning quality (DeepSeek 69% → 67%, Qwen 82% → 82%), but INT4/quanto collapsed to 0% on both models, with corrupted text appearing well before the answer parser fired. INT4 also scored 0/12 on a small short-context diagnostic, and the current HuggingFace QuantizedCache API does not let us configure group_size to debug it further. We treat this as a backend failure mode of optimum-quanto on these models rather than a general claim about INT4 cache quantization.

The residual_length parameter (default 128) turned out to be the most sensitive single knob. INT8/HQQ peaks at the default and degrades on either side; INT4/quanto peaks at the default and at N=32 with irregular behaviour in between. For systems-level deployment, the HuggingFace default is the best operating point we found.

How to reproduce

Everything below assumes Python 3.12 and an A100 (or equivalent, with cp.async and .m16n8k16 support). The runs were executed on Modal; the Modal entry points are at the repo root.

# install (matches container we used: nvcr.io/nvidia/pytorch:25.02-py3)
pip install -r requirements.txt    # transformers, optimum-quanto, hqq, pynvml, datasets, modal, matplotlib

# critical: forces CUDA-arch-list to A100 only.
# without this, optimum-quanto tries to compile its kernels for older
# architectures (sm_75) that don't have the features they need.
export TORCH_CUDA_ARCH_LIST="8.0"

# 3-config smoke test (FP16, INT8, INT4 at 256 tokens)
modal run run_validation.py

# full systems profile (all 32 configurations; checkpoint-resumes if preempted)
modal run run_profiling.py

# accuracy sweep (MATH-100 + MMLU on all configs)
modal run run_accuracy.py

# INT4/quanto corruption diagnostic
modal run run_int4_ablations.py

# build summary.csv + all figures from results/raw/
python -m src.analyze

If you are running locally instead of on Modal, the src/ modules are independent of the Modal entry points and can be driven directly. src/profiler.py::profile_configuration runs two warmup generations and three measured generations per configuration and returns a JSON-ready dict.

Configuration matrix

The core matrix is 2 models × 3 precisions × 4 output lengths = 24 configurations, with INT8 going through the HQQ backend and INT4 going through the optimum-quanto backend (the two precisions HuggingFace's cache_implementation="quantized" API actually supports). On top of that we run a residual_length sweep on DeepSeek-R1 at 2048 tokens for INT8 and INT4 across N ∈ {32, 64, 128, 256, 512}. The two N=128 entries share filenames with the corresponding core 2048-token configurations, so the logical design is 34 configurations and the on-disk artifact set is 32 distinct JSON files.

Every configuration runs twice for warmup (discarded) and three times for measurement, with greedy decoding (do_sample=False) and the same prompt for reproducibility. Cost on Modal A100-40GB came out around $60 for the whole study.

Notes for anyone extending this

A few practical things we hit along the way that aren't in the report:

  • transformers 4.57.6 doesn't expose QuantizedCacheConfig as a class — every online tutorial that uses it is wrong. You build the cache config as a plain dict and pass it through GenerationConfig(cache_implementation="quantized", cache_config={"backend": "...", "nbits": ...}). The backend name has to be lowercase; "HQQ" raises Unknown quantization backend.
  • PyTorch 2.7 dropped cuda_time_total from profiler events; use getattr(evt, "self_cuda_time_total", getattr(evt, "self_device_time_total", 0)) if you want the code to keep working on older torch versions too.
  • PyTorch Profiler slows down generation by 5–8x when both CPU and CUDA activities are recorded. Treat the relative kernel proportions as the trustworthy signal and use unprofiled runs (or summary.csv) for absolute throughput numbers.
  • Modal workers do get preempted on long accuracy runs. run_accuracy.py ships with checkpoint-resume so re-running the same command picks up where it left off, and run_profiling.py skips already-completed configurations by default.

Citing / acknowledgements

If you build on this, the report PDF in report/ is the canonical reference for the numbers. The infrastructure stack we depended on is HuggingFace Transformers, optimum-quanto, HQQ, PyTorch Profiler, NVIDIA NVML, and Modal — see the references list in the report for the full set.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages