Would you consider supporting softplus as an alternative activation mode (e.g., via a flag like gate_mode: "lower_bound" | "softplus")?
And, I tested fla's chunk_kda vs FlashKDA, in end to end result, lower_bound gate mode will cause repeatition in model's output.Both fla's chunk_kda with lower_bound gate mode and FlashKDA have the same problem, but when use fla's chunk_kda with softplus gate mode, it works fine.
the model used is kimi linear (https://modelscope.cn/models/moonshotai/Kimi-Linear-48B-A3B-Instruct).
Would you consider supporting softplus as an alternative activation mode (e.g., via a flag like
gate_mode: "lower_bound" | "softplus")?And, I tested fla's chunk_kda vs FlashKDA, in end to end result, lower_bound gate mode will cause repeatition in model's output.Both fla's chunk_kda with lower_bound gate mode and FlashKDA have the same problem, but when use fla's chunk_kda with softplus gate mode, it works fine.
the model used is kimi linear (https://modelscope.cn/models/moonshotai/Kimi-Linear-48B-A3B-Instruct).