[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations by kekaczma · Pull Request #21333 · intel/llvm

kekaczma · 2026-02-20T10:47:44Z

Problem:
In multi-device contexts, each device has its own primary CUDA context. When USM memory allocated on device A is accessed from a queue on device B, using cuMemcpyAsync fails because the stream belongs to context B but operates on memory from context A.

Root cause:

urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts
urEnqueueUSMMemcpy receives pointers without knowing their origin device
Cross-context operations require explicit cuMemcpyPeerAsync with both contexts

Solution:
Track allocation metadata in ur_context to record which device allocated each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect cross-device copies and use cuMemcpyPeerAsync with explicit source and destination contexts.

Changes:

Add AllocationMetadata map to ur_context_handle_t with thread-safe access
Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc
Unregister in urUSMFree
Query metadata in urEnqueueUSMMemcpy to detect cross-device copies
Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise

This is a clean, O(1) solution that correctly handles cross-context operations without trial-and-error approaches.

Problem: In multi-device contexts, each device has its own primary CUDA context. When USM memory allocated on device A is accessed from a queue on device B, using cuMemcpyAsync fails because the stream belongs to context B but operates on memory from context A. Root cause: - urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts - urEnqueueUSMMemcpy receives pointers without knowing their origin device - Cross-context operations require explicit cuMemcpyPeerAsync with both contexts Solution: Track allocation metadata in ur_context to record which device allocated each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect cross-device copies and use cuMemcpyPeerAsync with explicit source and destination contexts. Changes: - Add AllocationMetadata map to ur_context_handle_t with thread-safe access - Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc - Unregister in urUSMFree - Query metadata in urEnqueueUSMMemcpy to detect cross-device copies - Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise This is a clean, O(1) solution that correctly handles cross-context operations without trial-and-error approaches.

- Add metadata tracking to ur_context to map USM pointers to devices - Use cuMemcpyPeerAsync for cross-device USM copies - Enable urEnqueueKernelLaunchIncrementMultiDeviceTest for CUDA This fixes cross-device USM operations where cuMemcpyAsync silently fails when source and destination pointers belong to different CUDA contexts. Each device has its own primary context, so we track which device allocated each pointer and use cuMemcpyPeerAsync when needed.

- Try cuMemcpyPeerAsync first for explicit cross-device intent - Fall back to cuMemcpyAsync if peer copy fails (e.g., no P2P support) - cuMemcpyAsync works for managed memory due to automatic migration - Add null checks for safety in allocation registration

- For cross-device copies, synchronize stream then use cuMemcpy - Managed memory (USM Shared) requires sync for proper migration - Stream from queue's device context cannot do async peer operations - cuMemcpy handles managed memory migration automatically

The previous synchronous cuMemcpy approach failed because it cannot properly handle cross-device copies even in synchronous mode. cuMemcpyPeer explicitly takes source and destination contexts as parameters and is designed for peer-to-peer copies between different device contexts. This works for both USM Device and USM Shared memory. The stream is synchronized before calling cuMemcpyPeer because: 1. cuMemcpyPeer is synchronous (blocks until complete) 2. We need to ensure all pending operations in the stream finish first

For CUDA Managed Memory (CU_MEMORYTYPE_UNIFIED), use prefetch hints instead of relying solely on automatic migration: 1. Prefetch destination to queue's device before copy 2. Perform cuMemcpyAsync 3. Subsequent kernel access on other device will trigger migration Also properly handle Device memory cross-device with cuMemcpyPeerAsync.

CUDA Managed Memory (USM Shared) does not support explicit cross-device copies between separate per-device allocations. NVIDIA documentation shows Managed Memory as a single shared buffer with automatic migration. For multi-GPU tests on CUDA, use USM Device memory which supports cudaMemcpyPeer for peer-to-peer transfers, as documented in CUDA Programming Guide section 3.4.2.1.

For NVIDIA A2 GPUs (GA107 chip) which lack P2P support: - Prefetch both SRC and DST Managed Memory to CPU before copy - CUDA driver automatically stages through host: GPU0→CPU→GPU1 - Detect cross-device copies using CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL - Fix unused variable warning This enables multi-GPU tests to work on entry-level datacenter GPUs without NVLink/P2P, at reduced performance (host staging overhead).

kekaczma had a problem deploying to WindowsCILock February 20, 2026 10:48 — with GitHub Actions Error

kekaczma force-pushed the usm-metadata-tracking branch from 5242f44 to 5ac82b3 Compare February 20, 2026 10:48

kekaczma had a problem deploying to WindowsCILock February 20, 2026 10:49 — with GitHub Actions Error

kekaczma force-pushed the usm-metadata-tracking branch from 5ac82b3 to 66c6a4c Compare February 20, 2026 10:50

kekaczma temporarily deployed to WindowsCILock February 20, 2026 10:51 — with GitHub Actions Inactive

kekaczma changed the title ~~cuda: Track USM allocation metadata for cross-device operations~~ [WIP]{CUDA][UR] Track USM allocation metadata for cross-device operations Feb 20, 2026

kekaczma changed the title ~~[WIP]{CUDA][UR] Track USM allocation metadata for cross-device operations~~ [WIP][CUDA][UR] Track USM allocation metadata for cross-device operations Feb 20, 2026

kekaczma temporarily deployed to WindowsCILock February 20, 2026 11:13 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 20, 2026 11:13 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 20, 2026 11:59 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 20, 2026 12:24 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 20, 2026 12:24 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 20, 2026 12:24 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 20, 2026 12:46 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 20, 2026 13:11 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 20, 2026 13:11 — with GitHub Actions Failure

kekaczma had a problem deploying to WindowsCILock February 20, 2026 13:11 — with GitHub Actions Error

kekaczma temporarily deployed to WindowsCILock February 20, 2026 13:22 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 20, 2026 13:40 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 20, 2026 13:40 — with GitHub Actions Failure

kekaczma had a problem deploying to WindowsCILock February 20, 2026 13:40 — with GitHub Actions Error

kekaczma temporarily deployed to WindowsCILock February 20, 2026 13:50 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 20, 2026 14:15 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 23, 2026 15:14 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 23, 2026 15:41 — with GitHub Actions Inactive

[CUDA] Fix unused parameter warning in urUSMSharedAlloc

3aab64b

kekaczma temporarily deployed to WindowsCILock February 23, 2026 15:59 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 23, 2026 15:59 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 23, 2026 15:59 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 23, 2026 16:45 — with GitHub Actions Error

kekaczma force-pushed the usm-metadata-tracking branch from 9162cc4 to 15ef1e0 Compare February 23, 2026 17:10

kekaczma had a problem deploying to WindowsCILock February 23, 2026 17:11 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 23, 2026 20:35 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 23, 2026 20:53 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 23, 2026 20:53 — with GitHub Actions Failure

kekaczma force-pushed the usm-metadata-tracking branch from 6c55451 to 21ff7a5 Compare February 24, 2026 07:50

kekaczma temporarily deployed to WindowsCILock February 24, 2026 07:51 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 24, 2026 08:11 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 24, 2026 08:11 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 24, 2026 08:43 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 24, 2026 09:02 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 24, 2026 09:02 — with GitHub Actions Inactive

kekaczma force-pushed the usm-metadata-tracking branch from e017e21 to c5a1646 Compare February 24, 2026 09:12

kekaczma temporarily deployed to WindowsCILock February 24, 2026 09:12 — with GitHub Actions Inactive

kekaczma temporarily deployed to WindowsCILock February 24, 2026 09:33 — with GitHub Actions Inactive

kekaczma had a problem deploying to WindowsCILock February 24, 2026 09:33 — with GitHub Actions Failure

kekaczma temporarily deployed to WindowsCILock February 24, 2026 09:33 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333

[WIP][CUDA][UR] Track USM allocation metadata for cross-device operations#21333
kekaczma wants to merge 20 commits intosyclfrom
usm-metadata-tracking

kekaczma commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekaczma commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant