Skip to content

Conversation

@zhangyue207
Copy link
Collaborator

@zhangyue207 zhangyue207 commented Dec 22, 2025

image

@zhangyue207 zhangyue207 linked an issue Dec 22, 2025 that may be closed by this pull request
@zhangyue207 zhangyue207 changed the title issue/826: kunlun layernorm kunlun layernorm Dec 22, 2025
__global__ void layerNormKernel(
int32_t loop_idx,
Tdata *output, // [b, seq, dim]
Tdata *output_standardization, // [b, seq, dim]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几个注释[b, seq, dim]是不是删掉比较好,给人一种只能处理3D向量的错觉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

int32_t offset_output_standardization = 0;
int32_t offset_output_rstd_deviation = 0;
int32_t offset_input = 0;
for (int i = 0; i < ndim - 1; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个循环和上面计算t_coords[]的循环。类似能否合并

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

int32_t offset_output_rstd_deviation = 0;
int32_t offset_input = 0;
for (int i = 0; i < ndim - 1; i++) {
int32_t dim_i = shape_local[i].value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个循环和上面计算t_coords[]的循环能不能合并成一个循环,比如说也不要单独申请内存存储t_coords,直接使用寄存器变量,借助一次循环得到input和output的index,这样代码以及速度会不会更好一些,还是说有什么特殊考虑

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

? 255
: static_cast<int32_t>(info.othersize);
int32_t num_loops = (static_cast<int32_t>(info.othersize) + num_blocks - 1) / num_blocks;
for (int32_t i = 0; i < num_loops; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能不能把这个任务分配过程挪到kernel里面,这样写循环,会导致增加很多kernel启动开销吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@zhangyue207 zhangyue207 changed the title kunlun layernorm issue/826: kunlun layernorm Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEV] 昆仑芯 LayerNorm

3 participants