Skip to content

[WIP][CUDA][UR]Cuda multi gpu debug#21362

Draft
kekaczma wants to merge 7 commits intosyclfrom
cuda-multi-gpu-debug
Draft

[WIP][CUDA][UR]Cuda multi gpu debug#21362
kekaczma wants to merge 7 commits intosyclfrom
cuda-multi-gpu-debug

Conversation

@kekaczma
Copy link
Contributor

No description provided.

Add detailed logging infrastructure for CUDA adapter to help diagnose
multi-GPU issues. Controlled by UR_CUDA_CALL_TRACE=1 environment variable.

Features:
- Automatic logging of all CUDA API calls through UR_CHECK_ERROR macro
- Detailed parameter logging for key operations (kernel launch, memcpy)
- Context switch tracking with addresses
- Success/error result logging
- Zero overhead when disabled (compile-time check)

Logged operations:
- cuCtxGetCurrent/cuCtxSetCurrent with context addresses
- cuLaunchKernel with grid, block, shared memory, stream
- cuMemcpyAsync/cuMemcpyDtoDAsync with src/dst/size
- All other cuXxx functions with full call signature

Usage:
  UR_CUDA_CALL_TRACE=1 ./test-binary

This enables deep debugging of multi-GPU synchronization, context
management, and memory operations without modifying test code.
- Fix ScopedContext to restore original context in destructor
- Detect cross-device events in enqueueEventsWait
- Use cuEventSynchronize for cross-device (host sync)
- Keep cuStreamWaitEvent for same-device (async stream sync)

This fixes urEnqueueKernelLaunchIncrementMultiDeviceTest failure
where cuStreamWaitEvent cannot synchronize streams from different devices.
- Fix ScopedContext to handle nullptr Device without throwing
- Only restore context in destructor if original was non-null
- Remove CUDA skip from urEnqueueKernelLaunchIncrementMultiDeviceTest
- Remove unnecessary P2P support check (P2P is optimization, not requirement)
- Changed detection from Device to Context comparison
- After cuEventSynchronize, record barrier event in target stream
- This establishes ordering for future operations in the stream
@kekaczma kekaczma force-pushed the cuda-multi-gpu-debug branch from ffb8612 to 0ab2a88 Compare February 25, 2026 12:01
Previous approach compared UR context handles, which always matched
since test wraps all devices in single ur_context_handle_t.

Now compare underlying CUcontext via Device->getNativeContext().
Each physical device has unique CUcontext, enabling proper detection
of cross-device event synchronization scenarios.
Previous approach with barrier events was corrupting CUDA state.
cuEventRecord is asynchronous, so destroying the event immediately
after recording caused undefined behavior.

Now use simple host synchronization:
- cuEventSynchronize blocks CPU until event completes
- Subsequent enqueues to target stream happen after event completion
- No need for barrier events or additional synchronization
cuEventSynchronize alone is insufficient - it only blocks CPU.
Need to create ordering barrier in target stream so work enqueued
after this call executes after the cross-device event completes.

Use cuStreamSynchronize to establish this barrier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant