-
Notifications
You must be signed in to change notification settings - Fork 87
issue/826: kunlun layernorm #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| __global__ void layerNormKernel( | ||
| int32_t loop_idx, | ||
| Tdata *output, // [b, seq, dim] | ||
| Tdata *output_standardization, // [b, seq, dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几个注释[b, seq, dim]是不是删掉比较好,给人一种只能处理3D向量的错觉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
| int32_t offset_output_standardization = 0; | ||
| int32_t offset_output_rstd_deviation = 0; | ||
| int32_t offset_input = 0; | ||
| for (int i = 0; i < ndim - 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个循环和上面计算t_coords[]的循环。类似能否合并
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
| int32_t offset_output_rstd_deviation = 0; | ||
| int32_t offset_input = 0; | ||
| for (int i = 0; i < ndim - 1; i++) { | ||
| int32_t dim_i = shape_local[i].value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个循环和上面计算t_coords[]的循环能不能合并成一个循环,比如说也不要单独申请内存存储t_coords,直接使用寄存器变量,借助一次循环得到input和output的index,这样代码以及速度会不会更好一些,还是说有什么特殊考虑
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
| ? 255 | ||
| : static_cast<int32_t>(info.othersize); | ||
| int32_t num_loops = (static_cast<int32_t>(info.othersize) + num_blocks - 1) / num_blocks; | ||
| for (int32_t i = 0; i < num_loops; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能不能把这个任务分配过程挪到kernel里面,这样写循环,会导致增加很多kernel启动开销吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
Uh oh!
There was an error while loading. Please reload this page.