feat(compilation): Extend sequence parallelism to activations#1523
feat(compilation): Extend sequence parallelism to activations#1523AlpinDale wants to merge 1 commit into
Conversation
Adds a new pattern to the sequence parallelism pass to support activations like SiLU and GELU. This transforms "AllReduce -> Activation" into "ReduceScatter -> Activation -> AllGather", enabling further fusion opportunities.
There was a problem hiding this comment.
Summary of Changes
Hello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the compilation process by extending sequence parallelism to include activation functions. By introducing a specialized pattern for operations like SiLU and GELU, it reorders the communication and computation steps from a simple AllReduce followed by activation to a more optimized ReduceScatter, Activation, and then AllGather sequence. This strategic restructuring is designed to create new opportunities for operation fusion, ultimately leading to more efficient model execution.
Highlights
- New Optimization Pattern: Introduced a new
AllReduceActivationPatternclass to extend sequence parallelism optimizations to activation functions. - Transformation Logic: Implemented a transformation that converts the
AllReduce -> Activationsequence intoReduceScatter -> Activation -> AllGather. - Activation Support: Enabled this new optimization pattern for common activation functions, specifically SiLU and GELU, within the
SequenceParallelismPass. - Performance Enhancement: The change aims to unlock further fusion opportunities and improve compilation efficiency for models utilizing these activations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a new pattern to the sequence parallelism pass to handle activations like SiLU and GELU, transforming AllReduce -> Activation into ReduceScatter -> Activation -> AllGather. The implementation is solid and follows the existing structure of the file.
I have two suggestions to improve maintainability:
- Refactor the hardcoded list of activation functions into a module-level constant for better readability.
- A suggestion for a future refactoring of the helper class hierarchy to better separate concerns, as the new pattern is forced to handle a parameter (
epsilon) that is irrelevant to it.
Overall, this is a great addition that enables more fusion opportunities.
| device: str, | ||
| **kwargs): | ||
| # activation_op is e.g. torch.ops.aten.silu.default | ||
| super().__init__(epsilon=0.0, dtype=dtype, device=device, **kwargs) |
There was a problem hiding this comment.
Passing epsilon=0.0 here is functionally correct, but it indicates a minor issue with the class hierarchy. AllReduceActivationPattern does not use epsilon, but it's forced to provide it because its parent class, _SequenceParallelPatternHelper, is tightly coupled with RMSNorm-specific parameters.
For better long-term maintainability, consider refactoring the helper classes. A new base class could be introduced for sequence parallelism communication helpers, which AllReduceActivationPattern could inherit from directly. The existing _SequenceParallelPatternHelper could then combine both communication and RMSNorm functionalities.
This is not a blocker for the current change but a suggestion for future code health.
Adds a new pattern to the sequence parallelism pass to support activations like SiLU and GELU.
This transforms "AllReduce -> Activation" into "ReduceScatter -> Activation -> AllGather", enabling further fusion opportunities.