Skip to content

Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254

Open
srinivamd wants to merge 1 commit into
release/2.10from
fix/rocm-25028-conv3d-tolerance-override
Open

Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254
srinivamd wants to merge 1 commit into
release/2.10from
fix/rocm-25028-conv3d-tolerance-override

Conversation

@srinivamd
Copy link
Copy Markdown

@srinivamd srinivamd commented May 22, 2026

Summary

Cherry-pick three DecorateInfo tolerance overrides from upstream pytorch/pytorch PR #173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N, commit 8301e14b7003, merged 2026-02-22 by @jithunnair-amd).

These overrides relax float32 tolerance for nn.functional.conv3d backward on CUDA (atol=5e-5, rtol=5e-6) for:

  • TestOperators.test_jvpvjp
  • TestOperators.test_vjp
  • TestCompositeCompliance.test_backward

Root Cause

TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32 flakily fails on ROCm (ROCM-25028) because:

  1. MIOpen dilation workspace bug: A legacy guard in GetWorkSpaceSizeGEMM returns workspace=0 for dilation > 1, blocking preferred GEMM solvers (GemmBwdRest/GemmWrwUniversal) and forcing fallback to a less-accurate solver. Fix rocm-libraries#6507 was merged but reverted due to OOM on large diffusion workloads.

  2. Missing tolerance override: The fallback solver produces gradient error of ~2.5e-5 — within the atol=5e-5 override on release/2.12 but exceeding the default atol=1e-5 on release/2.10.

The upstream tolerance override was added after the release/2.10 branch point (~Oct 2025) and is already present on release/2.11 and release/2.12. Intel XPU documented the identical pattern (pytorch/pytorch PRs pytorch#177069, pytorch#177848).

Test plan

  • Verify PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=10 PYTORCH_TEST_WITH_ROCM=1 python test/test_ops.py TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32 passes consistently on MI308/gfx94X
  • Verify no regressions in test_ops.py -k TestCompositeComplianceCUDA test suite

Fixes: ROCM-25028

…torch#173188)

Cherry-pick three DecorateInfo tolerance overrides from upstream
pytorch/pytorch PR pytorch#173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N,
commit 8301e14, merged 2026-02-22).

These overrides relax float32 tolerance for conv3d backward on CUDA
(atol=5e-5, rtol=5e-6) for TestOperators.test_jvpvjp,
TestOperators.test_vjp, and TestCompositeCompliance.test_backward.

Without this, test_backward_nn_functional_conv3d_cuda_float32 flakily
fails on ROCm due to MIOpen solver fallback producing gradient error
of ~2.5e-5 (above default atol=1e-5 but within the 5e-5 override).

Fixes: ROCM-25028
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 22, 2026

Jenkins build for cdf49e3640ec27dfb26f0713528820b2aeb9e419 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@srinivamd
Copy link
Copy Markdown
Author

PR Evaluation Summary

Code change: Correct and safe. The 3 DecorateInfo tolerance overrides are consistent with the existing test_vjpvmap override (same atol=5e-5, rtol=5e-6), properly scoped to device_type="cuda", and correctly placed in the decorators tuple.

Jenkins failure: Infrastructure issue, not code. The "This commit cannot be built" error is unrelated to this 12-line test decorator change — likely a Jenkinsfile branch config or Docker image pull issue. Recommend retriggering the build.

Two corrections needed in the PR description:

  1. Upstream provenance: PR [ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N pytorch/pytorch#173188 shows merged: false in the GitHub API because PyTorch's merge-bot uses a rebase workflow. The content did land on upstream main as commit 8301e14b7003 on 2026-02-22. Worth clarifying to avoid confusion during review.

  2. Release branch claim is incorrect: The description states these overrides are "already present on release/2.11 and release/2.12" — they are not present on any ROCm/pytorch release branch. On upstream main, the conv3d opinfo was later refactored out of common_methods_invocations.py into test/functorch/test_ops.py, so these overrides were never back-ported to the ROCm release branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant