Skip to content

Enable Device-Side Assertions (DSA) for ROCm reductions#3239

Draft
jerrymannil wants to merge 1 commit into
release/2.11from
dsa_test
Draft

Enable Device-Side Assertions (DSA) for ROCm reductions#3239
jerrymannil wants to merge 1 commit into
release/2.11from
dsa_test

Conversation

@jerrymannil
Copy link
Copy Markdown
Collaborator

Summary

Wire CUDA Device-Side Assertions (DSA) through ReduceOp / reduce_kernel so reduction kernels participate in the DSA registry. No behavior change when TORCH_USE_CUDA_DSA is undefined; same on CUDA builds.

Changes

c10/cuda/CUDAException.h

  • Add TORCH_DSA_KERNEL_LAUNCH_T, a variant of TORCH_DSA_KERNEL_LAUNCH that accepts a parenthesized kernel expression as its first argument. This lets us launch templated kernels whose template-id contains top-level commas (e.g. reduce_kernel<max_threads / 4, 4, R>), which the existing macro cannot accept because the preprocessor would split the template argument list on those commas.

c10/cuda/CUDADeviceAssertion.h

  • Add CUDA_KERNEL_ASSERT2_RET(condition, ret_expr), a variant of CUDA_KERNEL_ASSERT2 for non-void device functions. On failure it records into the DSA registry and return ret_expr; from the enclosing function.

aten/src/ATen/native/cuda/Reduce.cuh

  • Include <c10/cuda/CUDADeviceAssertion.h>.
  • Add TORCH_DSA_KERNEL_ARGS to reduce_kernel and copy them onto the reduction value so member functions can use the macros via implicit member access.
  • Add assertions_data / assertion_caller_id members on ReduceOp with the exact names the macros expect.
  • Switch the three reduce_kernel launches in launch_reduce_kernel from triple-chevron to TORCH_DSA_KERNEL_LAUNCH_T.
  • Convert the 6 CUDA_KERNEL_ASSERT sites to CUDA_KERNEL_ASSERT2 (void methods) or CUDA_KERNEL_ASSERT2_RET (non-void methods) so failures route through the DSA registry instead of trap-on-assert.

Why the new macro variants

TORCH_DSA_KERNEL_LAUNCH and CUDA_KERNEL_ASSERT2 cover the common cases (non-templated or single-template-arg kernel; void-returning device function). The _T and _RET variants extend coverage to:

  • Kernels whose template-id contains top-level commas (preprocessor cannot pass them through a regular macro argument).
  • Assertions inside non-void device member functions (CUDA_KERNEL_ASSERT2's built-in return; is ill-typed there).

Both ReduceOp cases hit these limits, which is why the macro additions are necessary to enable DSA here.

Test plan

  • CUDA build (TORCH_USE_CUDA_DSA undefined): unchanged — CUDA_KERNEL_ASSERT2* fall back to assert(condition), TORCH_DSA_KERNEL_LAUNCH_T still launches with the same parameters as before.
  • CUDA build with -DTORCH_USE_CUDA_DSA: reduction kernels compile.
  • ROCm build (hipified counterparts of these files): reduction kernels compile.
  • Runtime DSA test: enable c10::cuda::CUDAKernelLaunchRegistry::get_singleton_ref().enabled_at_runtime = true; and induce a reduction-path device assertion (e.g. by violating noutputs == 1 precondition in a debug build); verify the host-side registry surfaces the kernel name, file, function, line.

Made with Cursor

Wire CUDA Device-Side Assertions through ReduceOp / reduce_kernel so reduction
kernels participate in the DSA registry (no behavior change when
TORCH_USE_CUDA_DSA is undefined; same on CUDA builds).

c10/cuda/CUDAException.h
  Add TORCH_DSA_KERNEL_LAUNCH_T, a variant of TORCH_DSA_KERNEL_LAUNCH that
  accepts a parenthesized kernel expression as its first argument. This
  lets us launch templated kernels whose template-id contains top-level
  commas (e.g. reduce_kernel<max_threads/4, 4, R>), which the existing
  macro cannot accept because the preprocessor would split the template
  argument list on those commas.

c10/cuda/CUDADeviceAssertion.h
  Add CUDA_KERNEL_ASSERT2_RET, a variant of CUDA_KERNEL_ASSERT2 for use
  inside non-void-returning device functions. On failure it records into
  the DSA registry and returns the supplied default expression.

aten/src/ATen/native/cuda/Reduce.cuh
  - Add c10/cuda/CUDADeviceAssertion.h include.
  - Add TORCH_DSA_KERNEL_ARGS to reduce_kernel and copy them onto the
    reduction value so member functions can use the macros via implicit
    member access.
  - Add assertions_data / assertion_caller_id members on ReduceOp with
    the exact names the macros expect.
  - Switch the three reduce_kernel launches in launch_reduce_kernel from
    plain triple-chevron to TORCH_DSA_KERNEL_LAUNCH_T.
  - Convert the 6 CUDA_KERNEL_ASSERT sites to CUDA_KERNEL_ASSERT2
    (void methods) or CUDA_KERNEL_ASSERT2_RET (non-void methods) so
    failures route through the DSA registry instead of trap-on-assert.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jerrymannil jerrymannil marked this pull request as draft May 20, 2026 08:05
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 20, 2026

Jenkins build for ff4abfbe296830ca75879277257858be9aee79ba commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant