[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection#18293
Conversation
…device-based tile selection Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18293
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit f5c972c with merge base ed57040 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…device-based tile selection Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) ghstack-source-id: 353936480 Pull Request resolved: #18293
This PR needs a
|
…W.cpp with device-based tile selection" Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) [ghstack-poisoned]
…device-based tile selection Pull Request resolved: #18293 Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. ghstack-source-id: 353936480 @exported-using-ghexport Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
…W.cpp with device-based tile selection" Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) [ghstack-poisoned]
…device-based tile selection Pull Request resolved: #18293 Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. ghstack-source-id: 353940602 @exported-using-ghexport Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
79ceefb
into
gh/SS-JIA/493/base
…device-based tile selection Pull Request resolved: #18293 Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. ghstack-source-id: 353940602 @exported-using-ghexport Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
…device-based tile selection Pull Request resolved: #18293 Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. ghstack-source-id: 353940602 @exported-using-ghexport Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
Stack from ghstack (oldest at bottom):
Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to
register pressure from the 4x2 output tile (17 vec4 registers per thread).
Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives
4-15x speedup on Mali with no regression on Adreno.
This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a
new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based
tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno.
Differential Revision: D97058158