Skip to content

psaesha/model-compression-resources

Repository files navigation

Model Compression Papers and Repositories

Generated from papers.txt and repos.txt · 21 papers · 1 repos · 2026-03-17 Add URLs to papers.txt or repos.txt and commit — Action regenerates this automatically.

Quantization

Papers

  1. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao, Ji Lin, Mickael Seznec et al. (2022)
  2. First-Order Error Matters: Accurate Compensation for Quantized Large Language Models — Xingyu Zheng, Haotong Qin, Yuye Li et al. (2025)
  3. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference — Benoit Jacob, Skirmantas Kligys, Bo Chen et al. (2017)
  4. 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs — Jinheng Wang, Hansong Zhou, Ting Song et al. (2024)
  5. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Tim Dettmers, Mike Lewis, Younes Belkada et al. (2022)
  6. QLoRA: Efficient Finetuning of Quantized LLMs — Tim Dettmers, Artidoro Pagnoni, Ari Holtzman et al. (2023)
  7. Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs — Paul Jonas Kurz, Tobias Jan Wieczorek, Mohamed A. Abdelsalam et al. (2026)
  8. Float8@2bits: Entropy Coding Enables Data-Free Model Compression — Patrick Putzky, Martin Genzel, Mattes Mollenhauer et al. (2026)
  9. CoopQ: Cooperative Game Inspired Layerwise Mixed Precision Quantization for LLMs — Junchen Zhao, Ali Derakhshan, Jayden Kana Hyman et al. (2025)
  10. Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model — Jakub Prejzner (2026)
  11. CASP: Compression of Large Multimodal Models Based on Attention Sparsity — Mohsen Gholami, Mohammad Akbari, Kevin Cannons et al. (2025)
  12. Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models — Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov et al. (2025)
  13. PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression — Vladimir Malinovskii, Denis Mazur, Ivan Ilin et al. (2024)
  14. Extreme Compression of Large Language Models via Additive Quantization — Vage Egiazarian, Andrei Panferov, Denis Kuznedelev et al. (2024)
  15. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin, Jiaming Tang, Haotian Tang et al. (2023)
  16. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler et al. (2022)
  17. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks — Albert Tseng, Jerry Chee, Qingyao Sun et al. (2024)
  18. Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression — Xu, Zhichao, Gupta, Ashim, Li, Tao et al. (2024)
  19. ReALLM: A general framework for LLM compression and fine-tuning — Louis Leconte, Lisa Bedin, Van Minh Nguyen et al. (2024)

Repositories

Pruning

Papers

  1. LLM Pruning and Distillation in Practice: The Minitron Approach — Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi et al. (2024)
  2. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference — Qichen Fu, Minsik Cho, Thomas Merth et al. (2024)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors