Skip to content

[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333

Draft
kekaczma wants to merge 20 commits intosyclfrom
usm-metadata-tracking
Draft

[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333
kekaczma wants to merge 20 commits intosyclfrom
usm-metadata-tracking

Conversation

@kekaczma
Copy link
Contributor

Problem:
In multi-device contexts, each device has its own primary CUDA context. When USM memory allocated on device A is accessed from a queue on device B, using cuMemcpyAsync fails because the stream belongs to context B but operates on memory from context A.

Root cause:

  • urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts
  • urEnqueueUSMMemcpy receives pointers without knowing their origin device
  • Cross-context operations require explicit cuMemcpyPeerAsync with both contexts

Solution:
Track allocation metadata in ur_context to record which device allocated each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect cross-device copies and use cuMemcpyPeerAsync with explicit source and destination contexts.

Changes:

  • Add AllocationMetadata map to ur_context_handle_t with thread-safe access
  • Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc
  • Unregister in urUSMFree
  • Query metadata in urEnqueueUSMMemcpy to detect cross-device copies
  • Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise

This is a clean, O(1) solution that correctly handles cross-context operations without trial-and-error approaches.

Problem:
In multi-device contexts, each device has its own primary CUDA context.
When USM memory allocated on device A is accessed from a queue on device B,
using cuMemcpyAsync fails because the stream belongs to context B but
operates on memory from context A.

Root cause:
- urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts
- urEnqueueUSMMemcpy receives pointers without knowing their origin device
- Cross-context operations require explicit cuMemcpyPeerAsync with both contexts

Solution:
Track allocation metadata in ur_context to record which device allocated
each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect
cross-device copies and use cuMemcpyPeerAsync with explicit source and
destination contexts.

Changes:
- Add AllocationMetadata map to ur_context_handle_t with thread-safe access
- Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc
- Unregister in urUSMFree
- Query metadata in urEnqueueUSMMemcpy to detect cross-device copies
- Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise

This is a clean, O(1) solution that correctly handles cross-context
operations without trial-and-error approaches.
@kekaczma kekaczma changed the title cuda: Track USM allocation metadata for cross-device operations [WIP]{CUDA][UR] Track USM allocation metadata for cross-device operations Feb 20, 2026
@kekaczma kekaczma changed the title [WIP]{CUDA][UR] Track USM allocation metadata for cross-device operations [WIP][CUDA][UR] Track USM allocation metadata for cross-device operations Feb 20, 2026
- Add metadata tracking to ur_context to map USM pointers to devices
- Use cuMemcpyPeerAsync for cross-device USM copies
- Enable urEnqueueKernelLaunchIncrementMultiDeviceTest for CUDA

This fixes cross-device USM operations where cuMemcpyAsync silently
fails when source and destination pointers belong to different CUDA
contexts. Each device has its own primary context, so we track which
device allocated each pointer and use cuMemcpyPeerAsync when needed.
- Try cuMemcpyPeerAsync first for explicit cross-device intent
- Fall back to cuMemcpyAsync if peer copy fails (e.g., no P2P support)
- cuMemcpyAsync works for managed memory due to automatic migration
- Add null checks for safety in allocation registration
- For cross-device copies, synchronize stream then use cuMemcpy
- Managed memory (USM Shared) requires sync for proper migration
- Stream from queue's device context cannot do async peer operations
- cuMemcpy handles managed memory migration automatically
The previous synchronous cuMemcpy approach failed because it cannot
properly handle cross-device copies even in synchronous mode.

cuMemcpyPeer explicitly takes source and destination contexts as
parameters and is designed for peer-to-peer copies between different
device contexts. This works for both USM Device and USM Shared memory.

The stream is synchronized before calling cuMemcpyPeer because:
1. cuMemcpyPeer is synchronous (blocks until complete)
2. We need to ensure all pending operations in the stream finish first
For CUDA Managed Memory (CU_MEMORYTYPE_UNIFIED), use prefetch hints
instead of relying solely on automatic migration:

1. Prefetch destination to queue's device before copy
2. Perform cuMemcpyAsync
3. Subsequent kernel access on other device will trigger migration

Also properly handle Device memory cross-device with cuMemcpyPeerAsync.
CUDA Managed Memory (USM Shared) does not support explicit cross-device
copies between separate per-device allocations. NVIDIA documentation shows
Managed Memory as a single shared buffer with automatic migration.

For multi-GPU tests on CUDA, use USM Device memory which supports
cudaMemcpyPeer for peer-to-peer transfers, as documented in CUDA
Programming Guide section 3.4.2.1.
For NVIDIA A2 GPUs (GA107 chip) which lack P2P support:
- Prefetch both SRC and DST Managed Memory to CPU before copy
- CUDA driver automatically stages through host: GPU0→CPU→GPU1
- Detect cross-device copies using CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL
- Fix unused variable warning

This enables multi-GPU tests to work on entry-level datacenter GPUs
without NVLink/P2P, at reduced performance (host staging overhead).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant