Generated from
papers.txtandrepos.txt· 21 papers · 1 repos · 2026-03-17 Add URLs topapers.txtorrepos.txtand commit — Action regenerates this automatically.
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao, Ji Lin, Mickael Seznec et al. (2022)
- First-Order Error Matters: Accurate Compensation for Quantized Large Language Models — Xingyu Zheng, Haotong Qin, Yuye Li et al. (2025)
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference — Benoit Jacob, Skirmantas Kligys, Bo Chen et al. (2017)
- 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs — Jinheng Wang, Hansong Zhou, Ting Song et al. (2024)
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Tim Dettmers, Mike Lewis, Younes Belkada et al. (2022)
- QLoRA: Efficient Finetuning of Quantized LLMs — Tim Dettmers, Artidoro Pagnoni, Ari Holtzman et al. (2023)
- Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs — Paul Jonas Kurz, Tobias Jan Wieczorek, Mohamed A. Abdelsalam et al. (2026)
- Float8@2bits: Entropy Coding Enables Data-Free Model Compression — Patrick Putzky, Martin Genzel, Mattes Mollenhauer et al. (2026)
- CoopQ: Cooperative Game Inspired Layerwise Mixed Precision Quantization for LLMs — Junchen Zhao, Ali Derakhshan, Jayden Kana Hyman et al. (2025)
- Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model — Jakub Prejzner (2026)
- CASP: Compression of Large Multimodal Models Based on Attention Sparsity — Mohsen Gholami, Mohammad Akbari, Kevin Cannons et al. (2025)
- Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models — Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov et al. (2025)
- PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression — Vladimir Malinovskii, Denis Mazur, Ivan Ilin et al. (2024)
- Extreme Compression of Large Language Models via Additive Quantization — Vage Egiazarian, Andrei Panferov, Denis Kuznedelev et al. (2024)
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin, Jiaming Tang, Haotian Tang et al. (2023)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler et al. (2022)
- QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks — Albert Tseng, Jerry Chee, Qingyao Sun et al. (2024)
- Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression — Xu, Zhichao, Gupta, Ashim, Li, Tao et al. (2024)
- ReALLM: A general framework for LLM compression and fine-tuning — Louis Leconte, Lisa Bedin, Van Minh Nguyen et al. (2024)
- ggml-org/llama.cpp — LLM inference in C/C++
- LLM Pruning and Distillation in Practice: The Minitron Approach — Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi et al. (2024)
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference — Qichen Fu, Minsik Cho, Thomas Merth et al. (2024)