[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333
Draft
[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333
Conversation
5242f44 to
5ac82b3
Compare
5ac82b3 to
66c6a4c
Compare
Problem: In multi-device contexts, each device has its own primary CUDA context. When USM memory allocated on device A is accessed from a queue on device B, using cuMemcpyAsync fails because the stream belongs to context B but operates on memory from context A. Root cause: - urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts - urEnqueueUSMMemcpy receives pointers without knowing their origin device - Cross-context operations require explicit cuMemcpyPeerAsync with both contexts Solution: Track allocation metadata in ur_context to record which device allocated each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect cross-device copies and use cuMemcpyPeerAsync with explicit source and destination contexts. Changes: - Add AllocationMetadata map to ur_context_handle_t with thread-safe access - Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc - Unregister in urUSMFree - Query metadata in urEnqueueUSMMemcpy to detect cross-device copies - Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise This is a clean, O(1) solution that correctly handles cross-context operations without trial-and-error approaches.
- Add metadata tracking to ur_context to map USM pointers to devices - Use cuMemcpyPeerAsync for cross-device USM copies - Enable urEnqueueKernelLaunchIncrementMultiDeviceTest for CUDA This fixes cross-device USM operations where cuMemcpyAsync silently fails when source and destination pointers belong to different CUDA contexts. Each device has its own primary context, so we track which device allocated each pointer and use cuMemcpyPeerAsync when needed.
- Try cuMemcpyPeerAsync first for explicit cross-device intent - Fall back to cuMemcpyAsync if peer copy fails (e.g., no P2P support) - cuMemcpyAsync works for managed memory due to automatic migration - Add null checks for safety in allocation registration
- For cross-device copies, synchronize stream then use cuMemcpy - Managed memory (USM Shared) requires sync for proper migration - Stream from queue's device context cannot do async peer operations - cuMemcpy handles managed memory migration automatically
The previous synchronous cuMemcpy approach failed because it cannot properly handle cross-device copies even in synchronous mode. cuMemcpyPeer explicitly takes source and destination contexts as parameters and is designed for peer-to-peer copies between different device contexts. This works for both USM Device and USM Shared memory. The stream is synchronized before calling cuMemcpyPeer because: 1. cuMemcpyPeer is synchronous (blocks until complete) 2. We need to ensure all pending operations in the stream finish first
9162cc4 to
15ef1e0
Compare
For CUDA Managed Memory (CU_MEMORYTYPE_UNIFIED), use prefetch hints instead of relying solely on automatic migration: 1. Prefetch destination to queue's device before copy 2. Perform cuMemcpyAsync 3. Subsequent kernel access on other device will trigger migration Also properly handle Device memory cross-device with cuMemcpyPeerAsync.
6c55451 to
21ff7a5
Compare
CUDA Managed Memory (USM Shared) does not support explicit cross-device copies between separate per-device allocations. NVIDIA documentation shows Managed Memory as a single shared buffer with automatic migration. For multi-GPU tests on CUDA, use USM Device memory which supports cudaMemcpyPeer for peer-to-peer transfers, as documented in CUDA Programming Guide section 3.4.2.1.
e017e21 to
c5a1646
Compare
For NVIDIA A2 GPUs (GA107 chip) which lack P2P support: - Prefetch both SRC and DST Managed Memory to CPU before copy - CUDA driver automatically stages through host: GPU0→CPU→GPU1 - Detect cross-device copies using CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL - Fix unused variable warning This enables multi-GPU tests to work on entry-level datacenter GPUs without NVLink/P2P, at reduced performance (host staging overhead).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
In multi-device contexts, each device has its own primary CUDA context. When USM memory allocated on device A is accessed from a queue on device B, using cuMemcpyAsync fails because the stream belongs to context B but operates on memory from context A.
Root cause:
Solution:
Track allocation metadata in ur_context to record which device allocated each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect cross-device copies and use cuMemcpyPeerAsync with explicit source and destination contexts.
Changes:
This is a clean, O(1) solution that correctly handles cross-context operations without trial-and-error approaches.