Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254
Add conv3d backward tolerance overrides for CUDA (cherry-pick from #173188)#3254srinivamd wants to merge 1 commit into
Conversation
…torch#173188) Cherry-pick three DecorateInfo tolerance overrides from upstream pytorch/pytorch PR pytorch#173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N, commit 8301e14, merged 2026-02-22). These overrides relax float32 tolerance for conv3d backward on CUDA (atol=5e-5, rtol=5e-6) for TestOperators.test_jvpvjp, TestOperators.test_vjp, and TestCompositeCompliance.test_backward. Without this, test_backward_nn_functional_conv3d_cuda_float32 flakily fails on ROCm due to MIOpen solver fallback producing gradient error of ~2.5e-5 (above default atol=1e-5 but within the 5e-5 override). Fixes: ROCM-25028
|
Jenkins build for cdf49e3640ec27dfb26f0713528820b2aeb9e419 commit finished as FAILURE |
|
PR Evaluation Summary Code change: Correct and safe. The 3 Jenkins failure: Infrastructure issue, not code. The Two corrections needed in the PR description:
|
Summary
Cherry-pick three
DecorateInfotolerance overrides from upstream pytorch/pytorch PR #173188 ([ROCm][CI] Upgrade ROCm CI to 7.2 - 4/N, commit8301e14b7003, merged 2026-02-22 by @jithunnair-amd).These overrides relax float32 tolerance for
nn.functional.conv3dbackward on CUDA (atol=5e-5,rtol=5e-6) for:TestOperators.test_jvpvjpTestOperators.test_vjpTestCompositeCompliance.test_backwardRoot Cause
TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32flakily fails on ROCm (ROCM-25028) because:MIOpen dilation workspace bug: A legacy guard in
GetWorkSpaceSizeGEMMreturns workspace=0 fordilation > 1, blocking preferred GEMM solvers (GemmBwdRest/GemmWrwUniversal) and forcing fallback to a less-accurate solver. Fix rocm-libraries#6507 was merged but reverted due to OOM on large diffusion workloads.Missing tolerance override: The fallback solver produces gradient error of ~2.5e-5 — within the
atol=5e-5override on release/2.12 but exceeding the defaultatol=1e-5on release/2.10.The upstream tolerance override was added after the release/2.10 branch point (~Oct 2025) and is already present on release/2.11 and release/2.12. Intel XPU documented the identical pattern (pytorch/pytorch PRs pytorch#177069, pytorch#177848).
Test plan
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=10 PYTORCH_TEST_WITH_ROCM=1 python test/test_ops.py TestCompositeComplianceCUDA.test_backward_nn_functional_conv3d_cuda_float32passes consistently on MI308/gfx94Xtest_ops.py -k TestCompositeComplianceCUDAtest suiteFixes: ROCM-25028