[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18292
Conversation
…blocked weight packing Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18292
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit db62c0f with merge base ed57040 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…ompute and blocked weight packing" Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/) [ghstack-poisoned]
f5ae537
into
gh/SS-JIA/492/base
…blocked weight packing Pull Request resolved: #18292 Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. ghstack-source-id: 353941147 @exported-using-ghexport Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)
Stack from ghstack (oldest at bottom):
Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.
The new
conv2d_pw_tiledshader reuses the shared linear tiled infrastructure(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.
Weight packing uses the same 4OC×4IC blocked format as linear via the
pack_fp_linear_weightshader. Dispatch uses DynamicDispatchNode for correctworkgroup size updates during graph resizing.
Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.
EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).
Authored with Claude.
Differential Revision: D96756792