Enable Device-Side Assertions (DSA) for ROCm reductions by jerrymannil · Pull Request #3239 · ROCm/pytorch

jerrymannil · 2026-05-20T07:52:14Z

Summary

Wire CUDA Device-Side Assertions (DSA) through ReduceOp / reduce_kernel so reduction kernels participate in the DSA registry. No behavior change when TORCH_USE_CUDA_DSA is undefined; same on CUDA builds.

Changes

c10/cuda/CUDAException.h

Add TORCH_DSA_KERNEL_LAUNCH_T, a variant of TORCH_DSA_KERNEL_LAUNCH that accepts a parenthesized kernel expression as its first argument. This lets us launch templated kernels whose template-id contains top-level commas (e.g. reduce_kernel<max_threads / 4, 4, R>), which the existing macro cannot accept because the preprocessor would split the template argument list on those commas.

c10/cuda/CUDADeviceAssertion.h

Add CUDA_KERNEL_ASSERT2_RET(condition, ret_expr), a variant of CUDA_KERNEL_ASSERT2 for non-void device functions. On failure it records into the DSA registry and return ret_expr; from the enclosing function.

aten/src/ATen/native/cuda/Reduce.cuh

Include <c10/cuda/CUDADeviceAssertion.h>.
Add TORCH_DSA_KERNEL_ARGS to reduce_kernel and copy them onto the reduction value so member functions can use the macros via implicit member access.
Add assertions_data / assertion_caller_id members on ReduceOp with the exact names the macros expect.
Switch the three reduce_kernel launches in launch_reduce_kernel from triple-chevron to TORCH_DSA_KERNEL_LAUNCH_T.
Convert the 6 CUDA_KERNEL_ASSERT sites to CUDA_KERNEL_ASSERT2 (void methods) or CUDA_KERNEL_ASSERT2_RET (non-void methods) so failures route through the DSA registry instead of trap-on-assert.

Why the new macro variants

TORCH_DSA_KERNEL_LAUNCH and CUDA_KERNEL_ASSERT2 cover the common cases (non-templated or single-template-arg kernel; void-returning device function). The _T and _RET variants extend coverage to:

Kernels whose template-id contains top-level commas (preprocessor cannot pass them through a regular macro argument).
Assertions inside non-void device member functions (CUDA_KERNEL_ASSERT2's built-in return; is ill-typed there).

Both ReduceOp cases hit these limits, which is why the macro additions are necessary to enable DSA here.

Test plan

CUDA build (TORCH_USE_CUDA_DSA undefined): unchanged — CUDA_KERNEL_ASSERT2* fall back to assert(condition), TORCH_DSA_KERNEL_LAUNCH_T still launches with the same parameters as before.
CUDA build with -DTORCH_USE_CUDA_DSA: reduction kernels compile.
ROCm build (hipified counterparts of these files): reduction kernels compile.
Runtime DSA test: enable c10::cuda::CUDAKernelLaunchRegistry::get_singleton_ref().enabled_at_runtime = true; and induce a reduction-path device assertion (e.g. by violating noutputs == 1 precondition in a debug build); verify the host-side registry surfaces the kernel name, file, function, line.

Made with Cursor

Wire CUDA Device-Side Assertions through ReduceOp / reduce_kernel so reduction kernels participate in the DSA registry (no behavior change when TORCH_USE_CUDA_DSA is undefined; same on CUDA builds). c10/cuda/CUDAException.h Add TORCH_DSA_KERNEL_LAUNCH_T, a variant of TORCH_DSA_KERNEL_LAUNCH that accepts a parenthesized kernel expression as its first argument. This lets us launch templated kernels whose template-id contains top-level commas (e.g. reduce_kernel<max_threads/4, 4, R>), which the existing macro cannot accept because the preprocessor would split the template argument list on those commas. c10/cuda/CUDADeviceAssertion.h Add CUDA_KERNEL_ASSERT2_RET, a variant of CUDA_KERNEL_ASSERT2 for use inside non-void-returning device functions. On failure it records into the DSA registry and returns the supplied default expression. aten/src/ATen/native/cuda/Reduce.cuh - Add c10/cuda/CUDADeviceAssertion.h include. - Add TORCH_DSA_KERNEL_ARGS to reduce_kernel and copy them onto the reduction value so member functions can use the macros via implicit member access. - Add assertions_data / assertion_caller_id members on ReduceOp with the exact names the macros expect. - Switch the three reduce_kernel launches in launch_reduce_kernel from plain triple-chevron to TORCH_DSA_KERNEL_LAUNCH_T. - Convert the 6 CUDA_KERNEL_ASSERT sites to CUDA_KERNEL_ASSERT2 (void methods) or CUDA_KERNEL_ASSERT2_RET (non-void methods) so failures route through the DSA registry instead of trap-on-assert. Co-authored-by: Cursor <cursoragent@cursor.com>

rocm-repo-management-api · 2026-05-20T08:07:01Z

Jenkins build for ff4abfbe296830ca75879277257858be9aee79ba commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jerrymannil marked this pull request as draft May 20, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Device-Side Assertions (DSA) for ROCm reductions#3239

Enable Device-Side Assertions (DSA) for ROCm reductions#3239
jerrymannil wants to merge 1 commit into
release/2.11from
dsa_test

jerrymannil commented May 20, 2026

Uh oh!

rocm-repo-management-api Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jerrymannil commented May 20, 2026

Summary

Changes

Why the new macro variants

Test plan

Uh oh!

rocm-repo-management-api Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rocm-repo-management-api Bot commented May 20, 2026 •

edited

Loading