Skip to content

Issue/253: feat: support offline int8 kv cache quantization#254

Open
qinyiqun wants to merge 3 commits intomainfrom
Issue/253
Open

Issue/253: feat: support offline int8 kv cache quantization#254
qinyiqun wants to merge 3 commits intomainfrom
Issue/253

Conversation

@qinyiqun
Copy link
Contributor

@qinyiqun qinyiqun commented Mar 4, 2026

Support offline int8 kv cache quantization for static kv cache

@qinyiqun qinyiqun requested review from a team and wooway777 March 4, 2026 07:20
@qinyiqun qinyiqun changed the title Issue/253: feat: support custom KV cache dtype for quantization Issue/253: feat: support offline int8 kv cache quantization Mar 18, 2026
std::optional<infinicore::Tensor> block_tables,
std::optional<infinicore::Tensor> slot_mapping) const;

infinicore::Tensor kv_cache_k_scale() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个是自己做一个成员变量getter然后自己原地用么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFINICORE_NN_PARAMETER(kv_cache_k_scale);
这个是用来声明一个scale参数,attention_layer会去读取参数,存储到kv_cache_k_scale_中,这个函数的目的就是拿到这个weight tensor


infinicore::DataType kv_cache_dtype() const;
void set_kv_cache_dtype(infinicore::DataType dtype) const;
bool kv_cache_dtype_is_set() const { return kv_cache_dtype_set_; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是kv cache可以传也可以不传,建议使用std::optional,下面StaticKVCache也可以这样,然后保证两边只有一个传dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

infinicore::quantization::QuantScheme get_quant_scheme() const;
std::shared_ptr<infinicore::nn::RoPE::ScalingConfig> get_rope_scaling() const;
void set_kv_quant_scheme(std::string kv_cache_dtype) {
if (kv_cache_dtype == "int8") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果没进if报个错什么的吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里默认是一个NONE,说明不使用量化

}
}

void set_kv_quant_scheme(std::string kv_cache_dtype) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉和model_config文件里的set功能应该换一下,model_config负责parse和分发,quant_config负责具体逻辑

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我感觉quant config应该完全控制量化相关的内容,model_config最多做一个转发

INFINICORE_NN_MODULE_INIT(k_norm, head_dim_, model_config_->get<double>("rms_norm_eps"), dtype, device);
}

switch (this->model_config_->get_kv_quant_scheme()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几个更改能不能想办法从attention layer里提出去让单独一个文件做,以后算法多了不好维护

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0):
_infinilm.StaticKVCacheConfig.__init__(self, max_batch_size, max_cache_len)

def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0, kv_cache_dtype: str | None = None):
Copy link
Collaborator

@PanZezhong1725 PanZezhong1725 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好额外提供一个支持框架内dtype的接口,并在python里提供parse映射,让使用python的用户可以通过进入python文件看到各个字符串的含义

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image vllm中在flash attention前都使用字符串,所以使用字符串我感觉问题不大,有注明即可

py::arg("max_batch_size") = 1,
py::arg("max_cache_len") = std::numeric_limits<infinicore::Size>::max())
.def(
py::init<infinicore::Size, infinicore::Size, std::string>(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对外接口尽量不要用字符串输入而应该使用框架内的dtype,把parse工作放在外面

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python层没有相应的parse功能,也不去解析model_config,既然model_config在C++中解析,这里也同样在C++中解析

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants