[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing by SS-JIA · Pull Request #18292 · pytorch/executorch

SS-JIA · 2026-03-18T14:24:37Z

Stack from ghstack (oldest at bottom):

[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection #18293
-> [ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing #18292
[ET-VK] Fix staging buffer allocation to check all memory types for HOST_CACHED #18291

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new conv2d_pw_tiled shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
pack_fp_linear_weight shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.

Differential Revision: D96756792

…blocked weight packing Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/) [ghstack-poisoned]

pytorch-bot · 2026-03-18T14:24:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18292

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit db62c0f with merge base ed57040 ():

NEW FAILURES - The following jobs have failed:

Build Presets / linux (linux, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
Error response from daemon: Get "https://308535385114.dkr.ecr.us-east-1.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 33d9088e14a8bd420647189f45feb4c46ad7acf3b10071b97ba9cbb15ae89bf4 /exec failed with exit code 139

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-18T14:25:16Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ompute and blocked weight packing" Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/) [ghstack-poisoned]

…blocked weight packing Pull Request resolved: #18292 Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. ghstack-source-id: 353941147 @exported-using-ghexport Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)

This was referenced Mar 18, 2026

[ET-VK] Fix staging buffer allocation to check all memory types for HOST_CACHED #18291

Merged

[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection #18293

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 18, 2026

manuelcandales approved these changes Mar 18, 2026

View reviewed changes

meta-codesync bot merged commit f5ae537 into gh/SS-JIA/492/base Mar 18, 2026
132 of 139 checks passed

meta-codesync bot deleted the gh/SS-JIA/492/head branch March 18, 2026 18:42

meta-codesync bot temporarily deployed to cherry-pick-bot March 18, 2026 18:42 Inactive

pytorchbot mentioned this pull request Mar 18, 2026

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing #18300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18292

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18292
meta-codesync[bot] merged 2 commits intogh/SS-JIA/492/basefrom
gh/SS-JIA/492/head

SS-JIA commented Mar 18, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18292

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 18, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 18, 2026 •

edited

Loading

pytorch-bot bot commented Mar 18, 2026 •

edited

Loading

This PR needs a `release notes:` label