Skip to content

Defer DLTensor deletion when CUDA graph capture is active.#6259

Open
JanuszL wants to merge 2 commits intoNVIDIA:mainfrom
JanuszL:dlpack_recycle
Open

Defer DLTensor deletion when CUDA graph capture is active.#6259
JanuszL wants to merge 2 commits intoNVIDIA:mainfrom
JanuszL:dlpack_recycle

Conversation

@JanuszL
Copy link
Copy Markdown
Contributor

@JanuszL JanuszL commented Mar 17, 2026

  • cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any
    stream on the device is being captured. DLTensorGraveyard now puts pending
    deletions back in the queue and retries after a short wait instead of
    propagating the error.

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

  • cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any
    stream on the device is being captured. DLTensorGraveyard now puts pending
    deletions back in the queue and retries after a short wait instead of
    propagating the error.

Additional information:

Affected modules and functionalities:

  • dltensor.cc

Key points relevant for the review:

  • NA

Tests:

  • Existing tests apply
    • TL0_jupyter
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

- cudaDeviceSynchronize returns cudaErrorStreamCaptureUnsupported when any
  stream on the device is being captured. DLTensorGraveyard now puts pending
  deletions back in the queue and retries after a short wait instead of
  propagating the error.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 17, 2026

!build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR fixes a crash/error propagation that occurred when DLTensorGraveyard's worker thread called cudaDeviceSynchronize() while a CUDA graph capture was active on the device. cudaDeviceSynchronize is disallowed during any stream capture and returns cudaErrorStreamCaptureUnsupported; the fix catches this error, splices the pending deletion items back into the queue, and retries after a short wait.

  • The new cudaErrorStreamCaptureUnsupported branch correctly re-locks the mutex before splicing tmp back into pending_, ensuring no shared pointers are dropped prematurely.
  • Lock ownership is properly maintained throughout the new code path: unlock before cudaDeviceSynchronize, re-lock before splice, cv_.wait_for atomically manages the lock, and continue re-enters the loop with the lock held — which is the correct state for cv_.wait at the top of the loop.
  • Shutdown correctness is preserved: if exit_requested_ becomes true during CUDA capture, cv_.wait_for will wake up (either via notify_one or the predicate becoming true), continue will hit the top-of-loop cv_.wait which returns immediately, and the if (exit_requested_) break will terminate the worker cleanly.
  • The tmp list is always empty after the splice, so its end-of-iteration destruction is a no-op with no risk of premature memory release.

Confidence Score: 4/5

  • This PR is safe to merge; the fix is logically correct, lock management is sound, and no data is lost across retries.
  • The change is minimal (12 lines in one file), targets a well-understood error code, and the surrounding synchronization logic is correctly preserved. No pre-existing invariants are broken. The only notable trade-off — that the 1 ms retry window collapses into an immediate re-entry because !pending_.empty() is true right after the splice — was already identified and accepted in a prior review thread as a minor quality-of-life concern rather than a correctness bug.
  • No files require special attention.

Important Files Changed

Filename Overview
dali/pipeline/data/dltensor.cc Adds handling for cudaErrorStreamCaptureUnsupported in DLTensorGraveyard::run(): splices items back into pending_, waits up to 1 ms, then retries. Lock management is correct and memory is preserved across retries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Worker thread wakes\nlock held] --> B["cv_.wait: pending_ non-empty?"]
    B -->|yes| C["list_t tmp = move(pending_)\nlock.unlock()"]
    C --> D[cudaDeviceSynchronize]
    D -->|cudaErrorCudartUnloading| E[break — process shutting down]
    D -->|cudaErrorStreamCaptureUnsupported| F["lock.lock()\nsplice tmp → pending_.begin()"]
    F --> G["cv_.wait_for 1ms\nexit_requested_ OR !pending_.empty()"]
    G --> H["continue → top of loop\n(pending_ is non-empty, wait returns immediately)"]
    H --> B
    D -->|other error| I["CUDA_CALL — throws"]
    D -->|success| J["tmp.clear()\nlock.lock()\nnext iteration"]
    J --> B
Loading

Last reviewed commit: ca3397e

Comment on lines +113 to +115
cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {
return exit_requested_;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Retry predicate misses new-enqueue wake-ups

The cv_.wait_for predicate only checks exit_requested_, so a cv_.notify_one() fired from enqueue while the worker is sleeping here will wake it up, check the predicate (false), and then go back to sleep for the remainder of the 1 ms window. The practical impact is negligible (1 ms is very short), but if new tensors arrive just after the capture starts and the predicate included !pending_.empty(), the retry could also serve as an ordinary processing wake-up point. Consider:

Suggested change
cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {
return exit_requested_;
});
cv_.wait_for(lock, std::chrono::milliseconds(1), [&]() {
return exit_requested_ || !pending_.empty();
});

This is a minor quality-of-life improvement, not a bug.

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46341167]: BUILD STARTED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 17, 2026

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46341668]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46341668]: BUILD PASSED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46351401]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46351401]: BUILD PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants