Conversation
| std::optional<infinicore::Tensor> block_tables, | ||
| std::optional<infinicore::Tensor> slot_mapping) const; | ||
|
|
||
| infinicore::Tensor kv_cache_k_scale() const { |
There was a problem hiding this comment.
这两个是自己做一个成员变量getter然后自己原地用么?
There was a problem hiding this comment.
INFINICORE_NN_PARAMETER(kv_cache_k_scale);
这个是用来声明一个scale参数,attention_layer会去读取参数,存储到kv_cache_k_scale_中,这个函数的目的就是拿到这个weight tensor
csrc/cache/kv_cache.hpp
Outdated
|
|
||
| infinicore::DataType kv_cache_dtype() const; | ||
| void set_kv_cache_dtype(infinicore::DataType dtype) const; | ||
| bool kv_cache_dtype_is_set() const { return kv_cache_dtype_set_; } |
There was a problem hiding this comment.
如果是kv cache可以传也可以不传,建议使用std::optional,下面StaticKVCache也可以这样,然后保证两边只有一个传dtype
| infinicore::quantization::QuantScheme get_quant_scheme() const; | ||
| std::shared_ptr<infinicore::nn::RoPE::ScalingConfig> get_rope_scaling() const; | ||
| void set_kv_quant_scheme(std::string kv_cache_dtype) { | ||
| if (kv_cache_dtype == "int8") { |
There was a problem hiding this comment.
这里默认是一个NONE,说明不使用量化
| } | ||
| } | ||
|
|
||
| void set_kv_quant_scheme(std::string kv_cache_dtype) { |
There was a problem hiding this comment.
感觉和model_config文件里的set功能应该换一下,model_config负责parse和分发,quant_config负责具体逻辑
There was a problem hiding this comment.
我感觉quant config应该完全控制量化相关的内容,model_config最多做一个转发
| INFINICORE_NN_MODULE_INIT(k_norm, head_dim_, model_config_->get<double>("rms_norm_eps"), dtype, device); | ||
| } | ||
|
|
||
| switch (this->model_config_->get_kv_quant_scheme()) { |
There was a problem hiding this comment.
这几个更改能不能想办法从attention layer里提出去让单独一个文件做,以后算法多了不好维护
| def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0): | ||
| _infinilm.StaticKVCacheConfig.__init__(self, max_batch_size, max_cache_len) | ||
|
|
||
| def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0, kv_cache_dtype: str | None = None): |
There was a problem hiding this comment.
最好额外提供一个支持框架内dtype的接口,并在python里提供parse映射,让使用python的用户可以通过进入python文件看到各个字符串的含义
| py::arg("max_batch_size") = 1, | ||
| py::arg("max_cache_len") = std::numeric_limits<infinicore::Size>::max()) | ||
| .def( | ||
| py::init<infinicore::Size, infinicore::Size, std::string>(), |
There was a problem hiding this comment.
对外接口尽量不要用字符串输入而应该使用框架内的dtype,把parse工作放在外面
There was a problem hiding this comment.
python层没有相应的parse功能,也不去解析model_config,既然model_config在C++中解析,这里也同样在C++中解析
…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

Support offline int8 kv cache quantization for static kv cache