gate mode diff

Would you consider supporting softplus as an alternative activation mode   (e.g., via a flag like `gate_mode: "lower_bound" | "softplus"`)?
And, I tested fla's chunk_kda vs FlashKDA, in end to end result, lower_bound gate mode will cause repeatition in model's output.Both fla's chunk_kda with lower_bound gate mode and FlashKDA have the same problem, but when use fla's chunk_kda with softplus gate mode, it works fine.
the model used is kimi linear (https://modelscope.cn/models/moonshotai/Kimi-Linear-48B-A3B-Instruct).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gate mode diff #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gate mode diff #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions