Is this a duplicate?
Type of Bug
Runtime Error
Component
cuda.core
Describe the bug
When DeviceMemoryResource is created without options (e.g., DeviceMemoryResource(dev)), it wraps the default device memory pool. This is a non-owned pool that is shared across all such references.
Currently, cuda.core initializes the internal _peer_accessible_by tracking variable to () (empty tuple), assuming no peer access. However, the actual driver-side peer access state of the default pool may differ if:
- Other code has modified peer access on the shared default pool
- Previous operations in the same process modified peer access and didn't clean up
- The Python tests use the shared pool and leave it in a modified state
This causes a mismatch between cuda.core's tracked state and the actual driver state, leading to:
- Incorrect
peer_accessible_by property values
- Unexpected behavior when setting peer access (no-op if we think we're already in the target state)
- Test failures that depend on a clean initial peer access state
How to Reproduce
from cuda.core import Device, DeviceMemoryResource
dev = Device(0)
# Create a DMR with the default pool and modify peer access
dmr1 = DeviceMemoryResource(dev)
dmr1.peer_accessible_by = (1,) # Enable peer access for device 1
# Create another DMR with the same default pool
dmr2 = DeviceMemoryResource(dev)
# dmr2._peer_accessible_by is (), but actual driver state has peer access for device 1
print(dmr2.peer_accessible_by) # Returns () -- WRONG! Should reflect actual state
Expected behavior
When wrapping a non-owned pool (the default device memory pool), DeviceMemoryResource should lazily query the CUDA driver to determine the actual peer access state.
This could be done using cuMemPoolGetAccess to query each peer device's access permissions on pool initialization.
Workaround
Use owned pools by specifying options:
from cuda.core import Device, DeviceMemoryResource, DeviceMemoryResourceOptions
dmr = DeviceMemoryResource(dev, DeviceMemoryResourceOptions())
This creates an owned pool with a known clean initial state.
Operating System
N/A (affects all platforms)
nvidia-smi output
N/A
Is this a duplicate?
Type of Bug
Runtime Error
Component
cuda.core
Describe the bug
When
DeviceMemoryResourceis created without options (e.g.,DeviceMemoryResource(dev)), it wraps the default device memory pool. This is a non-owned pool that is shared across all such references.Currently,
cuda.coreinitializes the internal_peer_accessible_bytracking variable to()(empty tuple), assuming no peer access. However, the actual driver-side peer access state of the default pool may differ if:This causes a mismatch between
cuda.core's tracked state and the actual driver state, leading to:peer_accessible_byproperty valuesHow to Reproduce
Expected behavior
When wrapping a non-owned pool (the default device memory pool),
DeviceMemoryResourceshould lazily query the CUDA driver to determine the actual peer access state.This could be done using
cuMemPoolGetAccessto query each peer device's access permissions on pool initialization.Workaround
Use owned pools by specifying options:
This creates an owned pool with a known clean initial state.
Operating System
N/A (affects all platforms)
nvidia-smi output
N/A