Skip to content

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18292

Merged
meta-codesync[bot] merged 2 commits intogh/SS-JIA/492/basefrom
gh/SS-JIA/492/head
Mar 18, 2026
Merged

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18292
meta-codesync[bot] merged 2 commits intogh/SS-JIA/492/basefrom
gh/SS-JIA/492/head

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Mar 18, 2026

Stack from ghstack (oldest at bottom):

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new conv2d_pw_tiled shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
pack_fp_linear_weight shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.

Differential Revision: D96756792

…blocked weight packing

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
`pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.

Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 18, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18292

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit db62c0f with merge base ed57040 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ompute and blocked weight packing"

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
`pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.

Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)

[ghstack-poisoned]
@meta-codesync meta-codesync bot merged commit f5ae537 into gh/SS-JIA/492/base Mar 18, 2026
132 of 139 checks passed
@meta-codesync meta-codesync bot deleted the gh/SS-JIA/492/head branch March 18, 2026 18:42
@meta-codesync meta-codesync bot temporarily deployed to cherry-pick-bot March 18, 2026 18:42 Inactive
SS-JIA pushed a commit that referenced this pull request Mar 18, 2026
…blocked weight packing

Pull Request resolved: #18292

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
`pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.
ghstack-source-id: 353941147
@exported-using-ghexport

Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants