Draft
Conversation
Add detailed logging infrastructure for CUDA adapter to help diagnose multi-GPU issues. Controlled by UR_CUDA_CALL_TRACE=1 environment variable. Features: - Automatic logging of all CUDA API calls through UR_CHECK_ERROR macro - Detailed parameter logging for key operations (kernel launch, memcpy) - Context switch tracking with addresses - Success/error result logging - Zero overhead when disabled (compile-time check) Logged operations: - cuCtxGetCurrent/cuCtxSetCurrent with context addresses - cuLaunchKernel with grid, block, shared memory, stream - cuMemcpyAsync/cuMemcpyDtoDAsync with src/dst/size - All other cuXxx functions with full call signature Usage: UR_CUDA_CALL_TRACE=1 ./test-binary This enables deep debugging of multi-GPU synchronization, context management, and memory operations without modifying test code.
- Fix ScopedContext to restore original context in destructor - Detect cross-device events in enqueueEventsWait - Use cuEventSynchronize for cross-device (host sync) - Keep cuStreamWaitEvent for same-device (async stream sync) This fixes urEnqueueKernelLaunchIncrementMultiDeviceTest failure where cuStreamWaitEvent cannot synchronize streams from different devices.
- Fix ScopedContext to handle nullptr Device without throwing - Only restore context in destructor if original was non-null - Remove CUDA skip from urEnqueueKernelLaunchIncrementMultiDeviceTest - Remove unnecessary P2P support check (P2P is optimization, not requirement)
- Changed detection from Device to Context comparison - After cuEventSynchronize, record barrier event in target stream - This establishes ordering for future operations in the stream
ffb8612 to
0ab2a88
Compare
Previous approach compared UR context handles, which always matched since test wraps all devices in single ur_context_handle_t. Now compare underlying CUcontext via Device->getNativeContext(). Each physical device has unique CUcontext, enabling proper detection of cross-device event synchronization scenarios.
Previous approach with barrier events was corrupting CUDA state. cuEventRecord is asynchronous, so destroying the event immediately after recording caused undefined behavior. Now use simple host synchronization: - cuEventSynchronize blocks CPU until event completes - Subsequent enqueues to target stream happen after event completion - No need for barrier events or additional synchronization
cuEventSynchronize alone is insufficient - it only blocks CPU. Need to create ordering barrier in target stream so work enqueued after this call executes after the cross-device event completes. Use cuStreamSynchronize to establish this barrier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.