Defer DLTensor deletion when CUDA graph capture is active. by JanuszL · Pull Request #6259 · NVIDIA/DALI

JanuszL · 2026-03-17T13:51:04Z

cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any
stream on the device is being captured. DLTensorGraveyard now puts pending
deletions back in the queue and retries after a short wait instead of
propagating the error.

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any
stream on the device is being captured. DLTensorGraveyard now puts pending
deletions back in the queue and retries after a short wait instead of
propagating the error.

Additional information:

Affected modules and functionalities:

dltensor.cc

Key points relevant for the review:

NA

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

- cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any stream on the device is being captured. DLTensorGraveyard now puts pending deletions back in the queue and retries after a short wait instead of propagating the error. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-03-17T13:51:12Z

!build

greptile-apps · 2026-03-17T13:54:09Z

Greptile Summary

This PR fixes a crash/error propagation that occurred when DLTensorGraveyard's worker thread called cudaDeviceSynchronize() while a CUDA graph capture was active on the device. cudaDeviceSynchronize is disallowed during any stream capture and returns cudaErrorStreamCaptureUnsupported; the fix catches this error, splices the pending deletion items back into the queue, and retries after a short wait.

The new cudaErrorStreamCaptureUnsupported branch correctly re-locks the mutex before splicing tmp back into pending_, ensuring no shared pointers are dropped prematurely.
Lock ownership is properly maintained throughout the new code path: unlock before cudaDeviceSynchronize, re-lock before splice, cv_.wait_for atomically manages the lock, and continue re-enters the loop with the lock held — which is the correct state for cv_.wait at the top of the loop.
Shutdown correctness is preserved: if exit_requested_ becomes true during CUDA capture, cv_.wait_for will wake up (either via notify_one or the predicate becoming true), continue will hit the top-of-loop cv_.wait which returns immediately, and the if (exit_requested_) break will terminate the worker cleanly.
The tmp list is always empty after the splice, so its end-of-iteration destruction is a no-op with no risk of premature memory release.

Confidence Score: 4/5

This PR is safe to merge; the fix is logically correct, lock management is sound, and no data is lost across retries.
The change is minimal (12 lines in one file), targets a well-understood error code, and the surrounding synchronization logic is correctly preserved. No pre-existing invariants are broken. The only notable trade-off — that the 1 ms retry window collapses into an immediate re-entry because !pending_.empty() is true right after the splice — was already identified and accepted in a prior review thread as a minor quality-of-life concern rather than a correctness bug.
No files require special attention.

Important Files Changed

Filename	Overview
dali/pipeline/data/dltensor.cc	Adds handling for cudaErrorStreamCaptureUnsupported in DLTensorGraveyard::run(): splices items back into pending_, waits up to 1 ms, then retries. Lock management is correct and memory is preserved across retries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Worker thread wakes\nlock held] --> B["cv_.wait: pending_ non-empty?"]
    B -->|yes| C["list_t tmp = move(pending_)\nlock.unlock()"]
    C --> D[cudaDeviceSynchronize]
    D -->|cudaErrorCudartUnloading| E[break — process shutting down]
    D -->|cudaErrorStreamCaptureUnsupported| F["lock.lock()\nsplice tmp → pending_.begin()"]
    F --> G["cv_.wait_for 1ms\nexit_requested_ OR !pending_.empty()"]
    G --> H["continue → top of loop\n(pending_ is non-empty, wait returns immediately)"]
    H --> B
    D -->|other error| I["CUDA_CALL — throws"]
    D -->|success| J["tmp.clear()\nlock.lock()\nnext iteration"]
    J --> B

_{Last reviewed commit: ca3397e}

greptile-apps · 2026-03-17T13:54:13Z

dali/pipeline/data/dltensor.cc

+        cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {
+          return exit_requested_;
+        });


Retry predicate misses new-enqueue wake-ups

The cv_.wait_for predicate only checks exit_requested_, so a cv_.notify_one() fired from enqueue while the worker is sleeping here will wake it up, check the predicate (false), and then go back to sleep for the remainder of the 1 ms window. The practical impact is negligible (1 ms is very short), but if new tensors arrive just after the capture starts and the predicate included !pending_.empty(), the retry could also serve as an ordinary processing wake-up point. Consider:

Suggested change

cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {

return exit_requested_;

});

cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {

return exit_requested_ || !pending_.empty();

});

This is a minor quality-of-life improvement, not a bug.

dali-automaton · 2026-03-17T13:55:59Z

CI MESSAGE: [46341167]: BUILD STARTED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-03-17T14:02:06Z

!build

dali-automaton · 2026-03-17T14:05:13Z

CI MESSAGE: [46341668]: BUILD STARTED

dali-automaton · 2026-03-17T16:01:14Z

CI MESSAGE: [46341668]: BUILD PASSED

dali-automaton · 2026-03-17T16:21:47Z

CI MESSAGE: [46351401]: BUILD STARTED

dali-automaton · 2026-03-17T18:50:03Z

CI MESSAGE: [46351401]: BUILD PASSED

mzient approved these changes Mar 17, 2026

View reviewed changes

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

Review fix

ca3397e

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the dlpack_recycle branch from 448839a to ca3397e Compare March 17, 2026 14:02

dali-automaton assigned szalpal and mdabek-nvidia Mar 18, 2026

mzient assigned mzient and unassigned szalpal Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer DLTensor deletion when CUDA graph capture is active.#6259

Defer DLTensor deletion when CUDA graph capture is active.#6259
JanuszL wants to merge 2 commits intoNVIDIA:mainfrom
JanuszL:dlpack_recycle

JanuszL commented Mar 17, 2026

Uh oh!

JanuszL commented Mar 17, 2026

Uh oh!

greptile-apps bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

JanuszL commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

JanuszL commented Mar 17, 2026

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

JanuszL commented Mar 17, 2026

Uh oh!

greptile-apps bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

JanuszL commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

dali-automaton commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

greptile-apps bot commented Mar 17, 2026 •

edited

Loading