Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188) by srinivamd · Pull Request #3254 · ROCm/pytorch

srinivamd · 2026-05-22T05:11:09Z

Summary

Cherry-pick three DecorateInfo tolerance overrides from upstream pytorch/pytorch PR #173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N, commit 8301e14b7003, merged 2026-02-22 by @jithunnair-amd).

These overrides relax float32 tolerance for nn.functional.conv3d backward on CUDA (atol=5e-5, rtol=5e-6) for:

TestOperators.test_jvpvjp
TestOperators.test_vjp
TestCompositeCompliance.test_backward

Root Cause

TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32 flakily fails on ROCm (ROCM-25028) because:

MIOpen dilation workspace bug: A legacy guard in GetWorkSpaceSizeGEMM returns workspace=0 for dilation > 1, blocking preferred GEMM solvers (GemmBwdRest/GemmWrwUniversal) and forcing fallback to a less-accurate solver. Fix rocm-libraries#6507 was merged but reverted due to OOM on large diffusion workloads.
Missing tolerance override: The fallback solver produces gradient error of ~2.5e-5 — within the atol=5e-5 override on release/2.12 but exceeding the default atol=1e-5 on release/2.10.

The upstream tolerance override was added after the release/2.10 branch point (~Oct 2025) and is already present on release/2.11 and release/2.12. Intel XPU documented the identical pattern (pytorch/pytorch PRs pytorch#177069, pytorch#177848).

Test plan

Verify PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=10 PYTORCH_TEST_WITH_ROCM=1 python test/test_ops.py TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32 passes consistently on MI308/gfx94X
Verify no regressions in test_ops.py -k TestCompositeComplianceCUDA test suite

Fixes: ROCM-25028

…torch#173188) Cherry-pick three DecorateInfo tolerance overrides from upstream pytorch/pytorch PR pytorch#173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N, commit 8301e14, merged 2026-02-22). These overrides relax float32 tolerance for conv3d backward on CUDA (atol=5e-5, rtol=5e-6) for TestOperators.test_jvpvjp, TestOperators.test_vjp, and TestCompositeCompliance.test_backward. Without this, test_backward_nn_functional_conv3d_cuda_float32 flakily fails on ROCm due to MIOpen solver fallback producing gradient error of ~2.5e-5 (above default atol=1e-5 but within the 5e-5 override). Fixes: ROCM-25028

rocm-repo-management-api · 2026-05-22T05:21:33Z

Jenkins build for cdf49e3640ec27dfb26f0713528820b2aeb9e419 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

srinivamd · 2026-05-22T17:12:34Z

PR Evaluation Summary

Code change: Correct and safe. The 3 DecorateInfo tolerance overrides are consistent with the existing test_vjpvmap override (same atol=5e-5, rtol=5e-6), properly scoped to device_type="cuda", and correctly placed in the decorators tuple.

Jenkins failure: Infrastructure issue, not code. The "This commit cannot be built" error is unrelated to this 12-line test decorator change — likely a Jenkinsfile branch config or Docker image pull issue. Recommend retriggering the build.

Two corrections needed in the PR description:

Upstream provenance: PR [ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N pytorch/pytorch#173188 shows merged: false in the GitHub API because PyTorch's merge-bot uses a rebase workflow. The content did land on upstream main as commit 8301e14b7003 on 2026-02-22. Worth clarifying to avoid confusion during review.
Release branch claim is incorrect: The description states these overrides are "already present on release/2.11 and release/2.12" — they are not present on any ROCm/pytorch release branch. On upstream main, the conv3d opinfo was later refactored out of common_methods_invocations.py into test/functorch/test_ops.py, so these overrides were never back-ported to the ROCm release branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254

Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254
srinivamd wants to merge 1 commit into
release/2.10from
fix/rocm-25028-conv3d-tolerance-override

srinivamd commented May 22, 2026 •

edited by atlassian Bot

Loading

Uh oh!

rocm-repo-management-api Bot commented May 22, 2026 •

edited

Loading

Uh oh!

srinivamd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srinivamd commented May 22, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Test plan

Uh oh!

rocm-repo-management-api Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srinivamd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srinivamd commented May 22, 2026 •

edited by atlassian Bot

Loading

rocm-repo-management-api Bot commented May 22, 2026 •

edited

Loading