[WIP][CUDA][UR]Cuda multi gpu debug by kekaczma · Pull Request #21362 · intel/llvm

kekaczma · 2026-02-25T09:04:04Z

No description provided.

Add detailed logging infrastructure for CUDA adapter to help diagnose multi-GPU issues. Controlled by UR_CUDA_CALL_TRACE=1 environment variable. Features: - Automatic logging of all CUDA API calls through UR_CHECK_ERROR macro - Detailed parameter logging for key operations (kernel launch, memcpy) - Context switch tracking with addresses - Success/error result logging - Zero overhead when disabled (compile-time check) Logged operations: - cuCtxGetCurrent/cuCtxSetCurrent with context addresses - cuLaunchKernel with grid, block, shared memory, stream - cuMemcpyAsync/cuMemcpyDtoDAsync with src/dst/size - All other cuXxx functions with full call signature Usage: UR_CUDA_CALL_TRACE=1 ./test-binary This enables deep debugging of multi-GPU synchronization, context management, and memory operations without modifying test code.

- Fix ScopedContext to restore original context in destructor - Detect cross-device events in enqueueEventsWait - Use cuEventSynchronize for cross-device (host sync) - Keep cuStreamWaitEvent for same-device (async stream sync) This fixes urEnqueueKernelLaunchIncrementMultiDeviceTest failure where cuStreamWaitEvent cannot synchronize streams from different devices.

- Fix ScopedContext to handle nullptr Device without throwing - Only restore context in destructor if original was non-null - Remove CUDA skip from urEnqueueKernelLaunchIncrementMultiDeviceTest - Remove unnecessary P2P support check (P2P is optimization, not requirement)

- Changed detection from Device to Context comparison - After cuEventSynchronize, record barrier event in target stream - This establishes ordering for future operations in the stream

Previous approach compared UR context handles, which always matched since test wraps all devices in single ur_context_handle_t. Now compare underlying CUcontext via Device->getNativeContext(). Each physical device has unique CUcontext, enabling proper detection of cross-device event synchronization scenarios.

Previous approach with barrier events was corrupting CUDA state. cuEventRecord is asynchronous, so destroying the event immediately after recording caused undefined behavior. Now use simple host synchronization: - cuEventSynchronize blocks CPU until event completes - Subsequent enqueues to target stream happen after event completion - No need for barrier events or additional synchronization

cuEventSynchronize alone is insufficient - it only blocks CPU. Need to create ordering barrier in target stream so work enqueued after this call executes after the cross-device event completes. Use cuStreamSynchronize to establish this barrier.

kekaczma added 4 commits February 24, 2026 12:39

Fix cross-device event synchronization with barrier event

0ab2a88

- Changed detection from Device to Context comparison - After cuEventSynchronize, record barrier event in target stream - This establishes ordering for future operations in the stream

kekaczma force-pushed the cuda-multi-gpu-debug branch from ffb8612 to 0ab2a88 Compare February 25, 2026 12:01

kekaczma added 3 commits February 25, 2026 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][CUDA][UR]Cuda multi gpu debug#21362

[WIP][CUDA][UR]Cuda multi gpu debug#21362
kekaczma wants to merge 7 commits intosyclfrom
cuda-multi-gpu-debug

kekaczma commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekaczma commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant