Enable Device-Side Assertions (DSA) for ROCm reductions#3239
Draft
jerrymannil wants to merge 1 commit into
Draft
Enable Device-Side Assertions (DSA) for ROCm reductions#3239jerrymannil wants to merge 1 commit into
jerrymannil wants to merge 1 commit into
Conversation
Wire CUDA Device-Side Assertions through ReduceOp / reduce_kernel so reduction
kernels participate in the DSA registry (no behavior change when
TORCH_USE_CUDA_DSA is undefined; same on CUDA builds).
c10/cuda/CUDAException.h
Add TORCH_DSA_KERNEL_LAUNCH_T, a variant of TORCH_DSA_KERNEL_LAUNCH that
accepts a parenthesized kernel expression as its first argument. This
lets us launch templated kernels whose template-id contains top-level
commas (e.g. reduce_kernel<max_threads/4, 4, R>), which the existing
macro cannot accept because the preprocessor would split the template
argument list on those commas.
c10/cuda/CUDADeviceAssertion.h
Add CUDA_KERNEL_ASSERT2_RET, a variant of CUDA_KERNEL_ASSERT2 for use
inside non-void-returning device functions. On failure it records into
the DSA registry and returns the supplied default expression.
aten/src/ATen/native/cuda/Reduce.cuh
- Add c10/cuda/CUDADeviceAssertion.h include.
- Add TORCH_DSA_KERNEL_ARGS to reduce_kernel and copy them onto the
reduction value so member functions can use the macros via implicit
member access.
- Add assertions_data / assertion_caller_id members on ReduceOp with
the exact names the macros expect.
- Switch the three reduce_kernel launches in launch_reduce_kernel from
plain triple-chevron to TORCH_DSA_KERNEL_LAUNCH_T.
- Convert the 6 CUDA_KERNEL_ASSERT sites to CUDA_KERNEL_ASSERT2
(void methods) or CUDA_KERNEL_ASSERT2_RET (non-void methods) so
failures route through the DSA registry instead of trap-on-assert.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Jenkins build for ff4abfbe296830ca75879277257858be9aee79ba commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wire CUDA Device-Side Assertions (DSA) through
ReduceOp/reduce_kernelso reduction kernels participate in the DSA registry. No behavior change whenTORCH_USE_CUDA_DSAis undefined; same on CUDA builds.Changes
c10/cuda/CUDAException.hTORCH_DSA_KERNEL_LAUNCH_T, a variant ofTORCH_DSA_KERNEL_LAUNCHthat accepts a parenthesized kernel expression as its first argument. This lets us launch templated kernels whose template-id contains top-level commas (e.g.reduce_kernel<max_threads / 4, 4, R>), which the existing macro cannot accept because the preprocessor would split the template argument list on those commas.c10/cuda/CUDADeviceAssertion.hCUDA_KERNEL_ASSERT2_RET(condition, ret_expr), a variant ofCUDA_KERNEL_ASSERT2for non-void device functions. On failure it records into the DSA registry andreturn ret_expr;from the enclosing function.aten/src/ATen/native/cuda/Reduce.cuh<c10/cuda/CUDADeviceAssertion.h>.TORCH_DSA_KERNEL_ARGStoreduce_kerneland copy them onto thereductionvalue so member functions can use the macros via implicit member access.assertions_data/assertion_caller_idmembers onReduceOpwith the exact names the macros expect.reduce_kernellaunches inlaunch_reduce_kernelfrom triple-chevron toTORCH_DSA_KERNEL_LAUNCH_T.CUDA_KERNEL_ASSERTsites toCUDA_KERNEL_ASSERT2(void methods) orCUDA_KERNEL_ASSERT2_RET(non-void methods) so failures route through the DSA registry instead of trap-on-assert.Why the new macro variants
TORCH_DSA_KERNEL_LAUNCHandCUDA_KERNEL_ASSERT2cover the common cases (non-templated or single-template-arg kernel; void-returning device function). The_Tand_RETvariants extend coverage to:CUDA_KERNEL_ASSERT2's built-inreturn;is ill-typed there).Both
ReduceOpcases hit these limits, which is why the macro additions are necessary to enable DSA here.Test plan
TORCH_USE_CUDA_DSAundefined): unchanged —CUDA_KERNEL_ASSERT2*fall back toassert(condition),TORCH_DSA_KERNEL_LAUNCH_Tstill launches with the same parameters as before.-DTORCH_USE_CUDA_DSA: reduction kernels compile.c10::cuda::CUDAKernelLaunchRegistry::get_singleton_ref().enabled_at_runtime = true;and induce a reduction-path device assertion (e.g. by violatingnoutputs == 1precondition in a debug build); verify the host-side registry surfaces the kernel name, file, function, line.Made with Cursor