-
Notifications
You must be signed in to change notification settings - Fork 718
[Common/PyTorch/JAX] make offset of ClampedSwiGLU configurable #2938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -336,10 +336,11 @@ void nvte_swiglu(const NVTETensor input, NVTETensor output, cudaStream_t stream) | |
| * It computes Act(input[N, :H]) x input[N, H:] | ||
| * \param[in] limit Clipping limits for gate and pre-activation. | ||
| * \param[in] alpha Scaling factor for the sigmoid function used in the activation. | ||
| * \param[in] glu_linear_offset Offset added to the linear component after clamping (default 1.0). | ||
| * \param[in] stream CUDA stream used for the operation. | ||
| */ | ||
|
Comment on lines
+339
to
341
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| void nvte_clamped_swiglu(const NVTETensor input, NVTETensor output, float limit, float alpha, | ||
| cudaStream_t stream); | ||
| float glu_linear_offset, cudaStream_t stream); | ||
|
|
||
| /*! \brief Computes the gated ReLU activation of the input. | ||
| * If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, | ||
|
|
@@ -413,10 +414,11 @@ void nvte_dswiglu(const NVTETensor grad, const NVTETensor input, NVTETensor outp | |
| * \param[in,out] output Outgoing gradient of shape [N, H * 2]. | ||
| * \param[in] limit Clipping limits for gate and pre-activation. | ||
| * \param[in] alpha Scaling factor for the sigmoid function used in the activation. | ||
| * \param[in] glu_linear_offset Offset added to the linear component after clamping (default 1.0). | ||
| * \param[in] stream CUDA stream used for the operation. | ||
| */ | ||
| void nvte_clamped_dswiglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, | ||
| float limit, float alpha, cudaStream_t stream); | ||
| float limit, float alpha, float glu_linear_offset, cudaStream_t stream); | ||
|
|
||
| /*! \brief Computes the gated ReLU activation gradient. | ||
| * If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define new APIs named nvte_clamped_swiglu_v2 and nvte_clamped_dswiglu_v2
and deprecate this API here to not break backward compatibility?