Skip to content

Improve DLPack support for external tensor consumption#6261

Open
JanuszL wants to merge 1 commit intoNVIDIA:mainfrom
JanuszL:dynamic_as_batch_fast_path
Open

Improve DLPack support for external tensor consumption#6261
JanuszL wants to merge 1 commit intoNVIDIA:mainfrom
JanuszL:dynamic_as_batch_fast_path

Conversation

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Mar 18, 2026

  • Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
    list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
    builds the TensorList in a single pass, recording a CUDA event on
    the provided stream. Passes stream as a keyword argument to
    __dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
    define def __dlpack__(self, *, stream=None).

  • Adds native DALI fast path in Batch.init: when given a list of
    already-evaluated ndd.Tensor objects, pass their _storage objects
    directly to TensorListGPU/CPU constructors, preserving all DALI
    metadata (layout, enum types) without going through DLPack.

  • Adds GPU DLPack fast path in Batch.init: when given a list of
    external GPU tensors (e.g. PyTorch) that support DLPack, use the
    new C++ bulk constructor to avoid per-tensor Python overhead.

  • Adds DLPack fallback for CPU read-only arrays in Tensor.init:
    catch BufferError from dlpack() and fall back to array
    interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr) # previously raised BufferError

Category:

Description:

  • Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
    list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
    builds the TensorList in a single pass, recording a CUDA event on
    the provided stream. Passes stream as a keyword argument to
    __dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
    define def __dlpack__(self, *, stream=None).

  • Adds native DALI fast path in Batch.init: when given a list of
    already-evaluated ndd.Tensor objects, pass their _storage objects
    directly to TensorListGPU/CPU constructors, preserving all DALI
    metadata (layout, enum types) without going through DLPack.

  • Adds GPU DLPack fast path in Batch.init: when given a list of
    external GPU tensors (e.g. PyTorch) that support DLPack, use the
    new C++ bulk constructor to avoid per-tensor Python overhead.

  • Adds DLPack fallback for CPU read-only arrays in Tensor.init:
    catch BufferError from dlpack() and fall back to array
    interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr) # previously raised BufferError

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
      • extended test_batch.py
        -extended test_tensor.py
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4580

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from 895265b to e432f38 Compare March 18, 2026 11:01
@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [46421794]: BUILD STARTED

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR improves DLPack support in DALI's experimental dynamic API by adding three fast paths and one fallback: (1) a C++ bulk TensorListGPU constructor that accepts a Python list of DLPack-compatible GPU objects in a single pass, (2) a native DALI fast path in Batch.__init__ that short-circuits directly to backend storage for already-evaluated ndd.Tensor objects, (3) a GPU DLPack fast path in Batch.__init__ that routes PyTorch-style GPU tensors through the new C++ bulk constructor, and (4) a BufferError/TypeError__array__ fallback in Tensor.__init__ so read-only NumPy arrays can be wrapped without error.

Key points:

  • The C++ bulk constructor correctly passes stream as a keyword argument ("stream"_a = stream_handle) to __dlpack__(), fixing the previously reported positional-argument incompatibility with NumPy ≥ 1.22 and JAX.
  • Device-ID consistency is now validated across all tensors in both TensorListFromListOfDLPackObjects and the GPU path of TensorListFromListOfTensors.
  • The native DALI fast path catches (TypeError, RuntimeError) — the RuntimeError arm is overly broad and will silently swallow genuine CUDA failures (e.g., OOM), causing a confusing slow-path error instead of the original exception.
  • The DLPack GPU fast path only inspects _tensors_list[0].__dlpack_device__() to decide the device type; a device-type mismatch on a later tensor is not caught on the Python side and will produce an opaque C++ error rather than a clear ValueError.

Confidence Score: 3/5

  • Mostly safe but the overly-broad RuntimeError catch in the native fast path can silently suppress real CUDA errors, and the per-batch device-type guard is incomplete — fix both before merging.
  • The C++ bulk constructor is well-structured with proper stream keyword argument passing and cross-tensor device-ID validation. The DLPack fallback in _tensor.py is clean. The main concerns are in _batch.py: (1) catching RuntimeError in the native DALI fast path will hide legitimate backend failures, degrading debuggability in production; (2) the DLPack fast path only validates the first tensor's device type, so a mixed-device-type batch silently reaches C++ and produces confusing low-level errors. Neither is a data-corruption risk, but both are observable regressions in error handling.
  • dali/python/nvidia/dali/experimental/dynamic/_batch.py requires the most attention — specifically the except (TypeError, RuntimeError) arm in the native fast path and the missing per-tensor device-type validation in the DLPack fast path.

Important Files Changed

Filename Overview
dali/python/backend_impl.cc Adds new TensorListFromListOfDLPackObjects C++ bulk constructor and device-ID consistency check in TensorListFromListOfTensors. DLPack stream keyword argument correctly fixed ("stream"_a = stream_handle). Device-ID cross-tensor validation added in both the new function and the existing GPU path. Logic looks correct; synchronization in non-contiguous path relies on DLPack protocol rather than an explicit copy_order.wait(), which is intentional but worth reviewing.
dali/python/nvidia/dali/experimental/dynamic/_batch.py Adds native DALI and GPU DLPack fast paths in Batch.__init__. The native fast path catches (TypeError, RuntimeError) which is overly broad — RuntimeError from CUDA failures (e.g., OOM) is swallowed and causes a fall-through to the slow path, degrading debuggability. The DLPack fast path only validates the device type of the first tensor; subsequent tensors with a different device type reach C++ without a clear Python-level error.
dali/python/nvidia/dali/experimental/dynamic/_tensor.py Adds BufferError/TypeError fallback to __array__ interface for read-only CPU tensors, and separates DLPack capsule extraction into its own variable for clarity. Logic is clean and the fallback chain is well-guarded.
dali/test/python/experimental_mode/test_batch.py Good test coverage for all new fast paths: native DALI tensor fast path (CPU + GPU), GPU DLPack bulk constructor (with/without layout), dtype-override fallback, and CPU slow path. Tests are well-structured and cover the intended behaviors.
dali/test/python/experimental_mode/test_tensor.py Adds tests for read-only numpy arrays and scalars as tensors. Coverage is adequate for the specific BufferError fallback added in _tensor.py.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Batch.__init__ with list of tensors] --> B{Materialise to list}
    B --> C{dtype is None AND first element\nis ndd.Tensor with _storage?}
    C -- Yes --> D[Native DALI fast path\nBuild TensorList from _storage objects]
    D --> E{Construction\nsucceeded?}
    E -- Yes --> Z[fast_path_used = True]
    E -- No: TypeError or RuntimeError --> F[Fall through]
    C -- No --> F
    F --> G{dtype is None AND first element\nhas __dlpack_device__ AND is GPU type 2?}
    G -- Yes --> H[DLPack GPU fast path\nTensorListGPU C++ bulk constructor]
    H --> I{TypeError?}
    I -- No --> Z
    I -- Yes --> J[Fall through]
    G -- No --> J
    J --> K[Slow path: wrap each element\nin Tensor one-by-one]
    K --> Z
    Z --> L[Batch object ready]

    subgraph CPP [C++ TensorListFromListOfDLPackObjects]
        M[For each object] --> N{Has __dlpack__?}
        N -- No --> O[Raise TypeError]
        N -- Yes --> P[Call __dlpack__ with stream keyword arg]
        P --> Q[FillTensorFromDlPack]
        Q --> R{i==0?}
        R -- Yes --> S[SetupLike, record copy_order, set expected_device_id]
        R -- No --> T{device_id matches?}
        T -- No --> U[Raise ValueError]
        T -- Yes --> V{type matches?}
        V -- No --> W[Raise TypeError]
        V -- Yes --> X[SetSample i]
        S --> X
    end

    H -.->|invokes| CPP
Loading

Last reviewed commit: "Improve DLPack suppo..."

@JanuszL JanuszL marked this pull request as draft March 18, 2026 11:52
@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from e432f38 to 4b868b5 Compare March 18, 2026 11:56
@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you re-review?

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch 2 times, most recently from d1c2614 to c9793e7 Compare March 18, 2026 12:09
@JanuszL JanuszL marked this pull request as ready for review March 18, 2026 12:11
@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [46425910]: BUILD STARTED

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from c9793e7 to 50843aa Compare March 18, 2026 12:34
@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you rereview?

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [46425910]: BUILD PASSED

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch 2 times, most recently from fbb02c6 to 61611fe Compare March 18, 2026 15:12
- Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
  list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
  builds the TensorList in a single pass, recording a CUDA event on
  the provided stream. Passes `stream` as a keyword argument to
  `__dlpack__()` for compatibility with NumPy ≥ 1.22 and JAX which
  define `def __dlpack__(self, *, stream=None)`.
- Adds native DALI fast path in Batch.__init__: when given a list of
  already-evaluated ndd.Tensor objects, pass their _storage objects
  directly to TensorListGPU/CPU constructors, preserving all DALI
  metadata (layout, enum types) without going through DLPack.
- Adds GPU DLPack fast path in Batch.__init__: when given a list of
  external GPU tensors (e.g. PyTorch) that support DLPack, use the
  new C++ bulk constructor to avoid per-tensor Python overhead.
- Adds DLPack fallback for CPU read-only arrays in Tensor.__init__:
  catch BufferError from __dlpack__() and fall back to __array__
  interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr)  # previously raised BufferError

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from 61611fe to b7a90c8 Compare March 18, 2026 15:13
@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [46436927]: BUILD STARTED

@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you rereview?

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [46436927]: BUILD PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants