Improve DLPack support for external tensor consumption by JanuszL · Pull Request #6261 · NVIDIA/DALI

JanuszL · 2026-03-18T10:59:25Z

Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
builds the TensorList in a single pass, recording a CUDA event on
the provided stream. Passes stream as a keyword argument to
__dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
define def __dlpack__(self, *, stream=None).
Adds native DALI fast path in Batch.init: when given a list of
already-evaluated ndd.Tensor objects, pass their _storage objects
directly to TensorListGPU/CPU constructors, preserving all DALI
metadata (layout, enum types) without going through DLPack.
Adds GPU DLPack fast path in Batch.init: when given a list of
external GPU tensors (e.g. PyTorch) that support DLPack, use the
new C++ bulk constructor to avoid per-tensor Python overhead.
Adds DLPack fallback for CPU read-only arrays in Tensor.init:
catch BufferError from dlpack() and fall back to array
interface. This makes the following work:

arr = np.array([1, 2, 3])
arr.flags.writeable = False
ndd.as_tensor(arr) # previously raised BufferError

Category:

Description:

Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
builds the TensorList in a single pass, recording a CUDA event on
the provided stream. Passes stream as a keyword argument to
__dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
define def __dlpack__(self, *, stream=None).
Adds native DALI fast path in Batch.init: when given a list of
already-evaluated ndd.Tensor objects, pass their _storage objects
directly to TensorListGPU/CPU constructors, preserving all DALI
metadata (layout, enum types) without going through DLPack.
Adds GPU DLPack fast path in Batch.init: when given a list of
external GPU tensors (e.g. PyTorch) that support DLPack, use the
new C++ bulk constructor to avoid per-tensor Python overhead.
Adds DLPack fallback for CPU read-only arrays in Tensor.init:
catch BufferError from dlpack() and fall back to array
interface. This makes the following work:

arr = np.array([1, 2, 3])
arr.flags.writeable = False
ndd.as_tensor(arr) # previously raised BufferError

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-4580

JanuszL · 2026-03-18T11:02:07Z

!build

dali/python/nvidia/dali/experimental/dynamic/_batch.py

dali-automaton · 2026-03-18T11:05:24Z

CI MESSAGE: [46421794]: BUILD STARTED

greptile-apps · 2026-03-18T11:06:16Z

Greptile Summary

This PR improves DLPack support in DALI's experimental dynamic API by adding three fast paths and one fallback: (1) a C++ bulk TensorListGPU constructor that accepts a Python list of DLPack-compatible GPU objects in a single pass, (2) a native DALI fast path in Batch.__init__ that short-circuits directly to backend storage for already-evaluated ndd.Tensor objects, (3) a GPU DLPack fast path in Batch.__init__ that routes PyTorch-style GPU tensors through the new C++ bulk constructor, and (4) a BufferError/TypeError → __array__ fallback in Tensor.__init__ so read-only NumPy arrays can be wrapped without error.

Key points:

The C++ bulk constructor correctly passes stream as a keyword argument ("stream"_a = stream_handle) to __dlpack__(), fixing the previously reported positional-argument incompatibility with NumPy ≥ 1.22 and JAX.
Device-ID consistency is now validated across all tensors in both TensorListFromListOfDLPackObjects and the GPU path of TensorListFromListOfTensors.
The native DALI fast path catches (TypeError, RuntimeError) — the RuntimeError arm is overly broad and will silently swallow genuine CUDA failures (e.g., OOM), causing a confusing slow-path error instead of the original exception.
The DLPack GPU fast path only inspects _tensors_list[0].__dlpack_device__() to decide the device type; a device-type mismatch on a later tensor is not caught on the Python side and will produce an opaque C++ error rather than a clear ValueError.

Confidence Score: 3/5

Mostly safe but the overly-broad RuntimeError catch in the native fast path can silently suppress real CUDA errors, and the per-batch device-type guard is incomplete — fix both before merging.
The C++ bulk constructor is well-structured with proper stream keyword argument passing and cross-tensor device-ID validation. The DLPack fallback in _tensor.py is clean. The main concerns are in _batch.py: (1) catching RuntimeError in the native DALI fast path will hide legitimate backend failures, degrading debuggability in production; (2) the DLPack fast path only validates the first tensor's device type, so a mixed-device-type batch silently reaches C++ and produces confusing low-level errors. Neither is a data-corruption risk, but both are observable regressions in error handling.
dali/python/nvidia/dali/experimental/dynamic/_batch.py requires the most attention — specifically the except (TypeError, RuntimeError) arm in the native fast path and the missing per-tensor device-type validation in the DLPack fast path.

Important Files Changed

Filename	Overview
dali/python/backend_impl.cc	Adds new `TensorListFromListOfDLPackObjects` C++ bulk constructor and device-ID consistency check in `TensorListFromListOfTensors`. DLPack stream keyword argument correctly fixed (`"stream"_a = stream_handle`). Device-ID cross-tensor validation added in both the new function and the existing GPU path. Logic looks correct; synchronization in non-contiguous path relies on DLPack protocol rather than an explicit `copy_order.wait()`, which is intentional but worth reviewing.
dali/python/nvidia/dali/experimental/dynamic/_batch.py	Adds native DALI and GPU DLPack fast paths in `Batch.__init__`. The native fast path catches `(TypeError, RuntimeError)` which is overly broad — `RuntimeError` from CUDA failures (e.g., OOM) is swallowed and causes a fall-through to the slow path, degrading debuggability. The DLPack fast path only validates the device type of the first tensor; subsequent tensors with a different device type reach C++ without a clear Python-level error.
dali/python/nvidia/dali/experimental/dynamic/_tensor.py	Adds `BufferError`/`TypeError` fallback to `__array__` interface for read-only CPU tensors, and separates DLPack capsule extraction into its own variable for clarity. Logic is clean and the fallback chain is well-guarded.
dali/test/python/experimental_mode/test_batch.py	Good test coverage for all new fast paths: native DALI tensor fast path (CPU + GPU), GPU DLPack bulk constructor (with/without layout), dtype-override fallback, and CPU slow path. Tests are well-structured and cover the intended behaviors.
dali/test/python/experimental_mode/test_tensor.py	Adds tests for read-only numpy arrays and scalars as tensors. Coverage is adequate for the specific `BufferError` fallback added in `_tensor.py`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Batch.__init__ with list of tensors] --> B{Materialise to list}
    B --> C{dtype is None AND first element\nis ndd.Tensor with _storage?}
    C -- Yes --> D[Native DALI fast path\nBuild TensorList from _storage objects]
    D --> E{Construction\nsucceeded?}
    E -- Yes --> Z[fast_path_used = True]
    E -- No: TypeError or RuntimeError --> F[Fall through]
    C -- No --> F
    F --> G{dtype is None AND first element\nhas __dlpack_device__ AND is GPU type 2?}
    G -- Yes --> H[DLPack GPU fast path\nTensorListGPU C++ bulk constructor]
    H --> I{TypeError?}
    I -- No --> Z
    I -- Yes --> J[Fall through]
    G -- No --> J
    J --> K[Slow path: wrap each element\nin Tensor one-by-one]
    K --> Z
    Z --> L[Batch object ready]

    subgraph CPP [C++ TensorListFromListOfDLPackObjects]
        M[For each object] --> N{Has __dlpack__?}
        N -- No --> O[Raise TypeError]
        N -- Yes --> P[Call __dlpack__ with stream keyword arg]
        P --> Q[FillTensorFromDlPack]
        Q --> R{i==0?}
        R -- Yes --> S[SetupLike, record copy_order, set expected_device_id]
        R -- No --> T{device_id matches?}
        T -- No --> U[Raise ValueError]
        T -- Yes --> V{type matches?}
        V -- No --> W[Raise TypeError]
        V -- Yes --> X[SetSample i]
        S --> X
    end

    H -.->|invokes| CPP

_{Last reviewed commit: "Improve DLPack suppo..."}

dali/python/nvidia/dali/experimental/dynamic/_batch.py

dali/python/backend_impl.cc

dali/python/nvidia/dali/experimental/dynamic/_batch.py

dali/python/nvidia/dali/experimental/dynamic/_tensor.py

JanuszL · 2026-03-18T11:56:51Z

@greptileai - can you re-review?

JanuszL · 2026-03-18T12:12:28Z

!build

dali-automaton · 2026-03-18T12:15:20Z

CI MESSAGE: [46425910]: BUILD STARTED

dali/python/backend_impl.cc

JanuszL · 2026-03-18T13:40:08Z

@greptileai - can you rereview?

dali-automaton · 2026-03-18T14:56:53Z

CI MESSAGE: [46425910]: BUILD PASSED

- Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and builds the TensorList in a single pass, recording a CUDA event on the provided stream. Passes `stream` as a keyword argument to `__dlpack__()` for compatibility with NumPy ≥ 1.22 and JAX which define `def __dlpack__(self, *, stream=None)`. - Adds native DALI fast path in Batch.__init__: when given a list of already-evaluated ndd.Tensor objects, pass their _storage objects directly to TensorListGPU/CPU constructors, preserving all DALI metadata (layout, enum types) without going through DLPack. - Adds GPU DLPack fast path in Batch.__init__: when given a list of external GPU tensors (e.g. PyTorch) that support DLPack, use the new C++ bulk constructor to avoid per-tensor Python overhead. - Adds DLPack fallback for CPU read-only arrays in Tensor.__init__: catch BufferError from __dlpack__() and fall back to __array__ interface. This makes the following work: arr = np.array([1, 2, 3]) arr.flags.writeable = False ndd.as_tensor(arr) # previously raised BufferError Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-03-18T15:13:29Z

!build

dali-automaton · 2026-03-18T15:15:28Z

CI MESSAGE: [46436927]: BUILD STARTED

dali/python/nvidia/dali/experimental/dynamic/_batch.py

JanuszL · 2026-03-18T17:07:41Z

@greptileai - can you rereview?

dali-automaton · 2026-03-18T17:28:03Z

CI MESSAGE: [46436927]: BUILD PASSED

JanuszL force-pushed the dynamic_as_batch_fast_path branch from 895265b to e432f38 Compare March 18, 2026 11:01

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

dali/python/nvidia/dali/experimental/dynamic/_batch.py Fixed Show fixed Hide fixed

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

dali-automaton assigned mdabek-nvidia and szalpal Mar 18, 2026

JanuszL marked this pull request as draft March 18, 2026 11:52

JanuszL force-pushed the dynamic_as_batch_fast_path branch from e432f38 to 4b868b5 Compare March 18, 2026 11:56

JanuszL force-pushed the dynamic_as_batch_fast_path branch 2 times, most recently from d1c2614 to c9793e7 Compare March 18, 2026 12:09

JanuszL marked this pull request as ready for review March 18, 2026 12:11

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

dali/python/backend_impl.cc Outdated Show resolved Hide resolved

JanuszL force-pushed the dynamic_as_batch_fast_path branch from c9793e7 to 50843aa Compare March 18, 2026 12:34

JanuszL force-pushed the dynamic_as_batch_fast_path branch 2 times, most recently from fbb02c6 to 61611fe Compare March 18, 2026 15:12

JanuszL force-pushed the dynamic_as_batch_fast_path branch from 61611fe to b7a90c8 Compare March 18, 2026 15:13

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

dali/python/nvidia/dali/experimental/dynamic/_batch.py Dismissed Show dismissed Hide dismissed

Conversation

JanuszL commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

JanuszL commented Mar 18, 2026

Uh oh!

Uh oh!

dali-automaton commented Mar 18, 2026

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanuszL commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JanuszL commented Mar 18, 2026

Uh oh!

dali-automaton commented Mar 18, 2026

Uh oh!

Uh oh!

JanuszL commented Mar 18, 2026

Uh oh!

dali-automaton commented Mar 18, 2026

Uh oh!

JanuszL commented Mar 18, 2026

Uh oh!

dali-automaton commented Mar 18, 2026

Uh oh!

Uh oh!

JanuszL commented Mar 18, 2026

Uh oh!

dali-automaton commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JanuszL commented Mar 18, 2026 •

edited

Loading

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

JanuszL commented Mar 18, 2026 •

edited

Loading