Issue/253: feat: support offline int8 kv cache quantization by qinyiqun · Pull Request #254 · InfiniTensor/InfiniLM

qinyiqun · 2026-03-04T07:20:45Z

Support offline int8 kv cache quantization for static kv cache

wooway777 · 2026-03-18T11:54:15Z

csrc/models/llama/llama_attention.hpp

                                      std::optional<infinicore::Tensor> block_tables,
                                      std::optional<infinicore::Tensor> slot_mapping) const;

+    infinicore::Tensor kv_cache_k_scale() const {


这两个是自己做一个成员变量getter然后自己原地用么？

INFINICORE_NN_PARAMETER(kv_cache_k_scale);
这个是用来声明一个scale参数，attention_layer会去读取参数，存储到kv_cache_k_scale_中，这个函数的目的就是拿到这个weight tensor

examples/jiuge.py

PanZezhong1725 · 2026-03-19T01:07:19Z

csrc/cache/kv_cache.hpp


+    infinicore::DataType kv_cache_dtype() const;
+    void set_kv_cache_dtype(infinicore::DataType dtype) const;
+    bool kv_cache_dtype_is_set() const { return kv_cache_dtype_set_; }


如果是kv cache可以传也可以不传，建议使用std::optional，下面StaticKVCache也可以这样，然后保证两边只有一个传dtype

PanZezhong1725 · 2026-03-19T01:22:56Z

csrc/config/model_config.hpp

    infinicore::quantization::QuantScheme get_quant_scheme() const;
    std::shared_ptr<infinicore::nn::RoPE::ScalingConfig> get_rope_scaling() const;
+    void set_kv_quant_scheme(std::string kv_cache_dtype) {
+        if (kv_cache_dtype == "int8") {


如果没进if报个错什么的吧

这里默认是一个NONE，说明不使用量化

PanZezhong1725 · 2026-03-19T01:25:25Z

csrc/config/quant_config.hpp

        }
    }

+    void set_kv_quant_scheme(std::string kv_cache_dtype) {


感觉和model_config文件里的set功能应该换一下，model_config负责parse和分发，quant_config负责具体逻辑

我感觉quant config应该完全控制量化相关的内容，model_config最多做一个转发

PanZezhong1725 · 2026-03-19T01:28:09Z

csrc/models/llama/llama_attention.cpp

        INFINICORE_NN_MODULE_INIT(k_norm, head_dim_, model_config_->get<double>("rms_norm_eps"), dtype, device);
    }
+
+    switch (this->model_config_->get_kv_quant_scheme()) {


这几个更改能不能想办法从attention layer里提出去让单独一个文件做，以后算法多了不好维护

PanZezhong1725 · 2026-03-19T01:30:32Z

python/infinilm/cache/cache.py

-    def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0):
-        _infinilm.StaticKVCacheConfig.__init__(self, max_batch_size, max_cache_len)
-
+    def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0, kv_cache_dtype: str | None = None):


最好额外提供一个支持框架内dtype的接口，并在python里提供parse映射，让使用python的用户可以通过进入python文件看到各个字符串的含义

vllm中在flash attention前都使用字符串，所以使用字符串我感觉问题不大，有注明即可

PanZezhong1725 · 2026-03-19T01:56:41Z

csrc/pybind11/cache/cache.hpp

            py::arg("max_batch_size") = 1,
            py::arg("max_cache_len") = std::numeric_limits<infinicore::Size>::max())
+        .def(
+            py::init<infinicore::Size, infinicore::Size, std::string>(),


对外接口尽量不要用字符串输入而应该使用框架内的dtype，把parse工作放在外面

python层没有相应的parse功能，也不去解析model_config，既然model_config在C++中解析，这里也同样在C++中解析

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

qinyiqun requested review from a team and wooway777 March 4, 2026 07:20

qinyiqun force-pushed the Issue/253 branch from 211913f to d3be4cc Compare March 4, 2026 07:21

Issue/253: feat: support custom KV cache dtype for quantization

1fc301f

qinyiqun force-pushed the Issue/253 branch from d3be4cc to 240464b Compare March 18, 2026 09:11

Issue/253: Support offline int8 inference with calibrated models

a2a2dac

qinyiqun force-pushed the Issue/253 branch from 240464b to a2a2dac Compare March 18, 2026 09:28

qinyiqun changed the title ~~Issue/253: feat: support custom KV cache dtype for quantization~~ Issue/253: feat: support offline int8 kv cache quantization Mar 18, 2026

wooway777 reviewed Mar 18, 2026

View reviewed changes

examples/jiuge.py Show resolved Hide resolved

PanZezhong1725 requested changes Mar 19, 2026

View reviewed changes

Issue/253: (1) Refactor attention KV cache quantization to layers/kv_…

4aa8c3e

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

qinyiqun requested review from PanZezhong1725 and wooway777 March 19, 2026 08:32

Conversation

qinyiqun commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PanZezhong1725 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qinyiqun commented Mar 4, 2026 •

edited

Loading

PanZezhong1725 Mar 19, 2026 •

edited

Loading