Fix ThreadPool shutdown deadlock on Windows with CUDA#2027
Open
Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Open
Fix ThreadPool shutdown deadlock on Windows with CUDA#2027Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Conversation
On Windows, the ThreadPool destructor can deadlock during CUDA model cleanup. The root cause is a race condition between queue.close() and the worker's idle() callback: 1. Worker enters idle() -> synchronize_stream() (blocking CUDA call) 2. Main thread calls queue.close() -> sets _request_end, notifies 3. Worker is stuck in synchronize_stream(), misses the notification 4. Worker::join() blocks indefinitely This manifests as application hangs when unloading Whisper/NLLB models after transcription or translation completes. Confirmed on RTX 4090 and RTX 5070 (Blackwell) with CUDA 12.x and 13.x. Changes: - Add Worker::prepare_shutdown() virtual method, called by ThreadPool::~ThreadPool() BEFORE queue.close(). This allows workers to stop blocking idle operations before the queue signals shutdown. - ReplicaWorker overrides prepare_shutdown() to set _shutting_down atomic flag with release semantics. The idle() method checks this flag with acquire semantics before calling synchronize_stream(). - Worker::join() now accepts a timeout_ms parameter (default 5000ms). If the worker thread doesn't finish within the timeout, both the worker thread and the join helper are detached to prevent blocking the process. This handles the case where finalize() -> _replica.reset() hangs on CUDA resource deallocation. - JobQueue::get() before_wait loop now checks _request_end before calling before_wait(), and releases the lock during the callback to prevent holding the mutex during blocking CUDA operations. Tested with CTranslate2 Whisper and NLLB models on Windows 10/11 with CUDA 12.8 and 13.2 (sm_75 through sm_120).
Contributor
|
There is another PR for the same issue: #1912 |
Contributor
|
I think it's time to replace |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes application hang when destroying CTranslate2 models on Windows with CUDA GPU acceleration. The
ThreadPooldestructor deadlocks because worker threads get stuck in blocking CUDA synchronization calls during shutdown.Problem
When a CTranslate2 model is destroyed (e.g., after Whisper transcription or NLLB translation completes), the
ThreadPooldestructor callsqueue.close()thenworker->join(). A race condition causes deadlock:idle()→synchronize_stream()(blocking CUDA call)queue.close()→ sets_request_end, notifies condition variablesynchronize_stream(), not waiting on the condition variable — notification is lostWorker::join()blocks indefinitely waiting for the worker to exitThis affects any application using CTranslate2 with CUDA on Windows. Confirmed on:
Related issues: #1782, SYSTRAN/faster-whisper#71
Changes
include/ctranslate2/thread_pool.h:Worker::join()now acceptstimeout_msparameter (default 5000ms). If the worker doesn't finish in time, both threads are detached to prevent indefinite blocking.Worker::prepare_shutdown()virtual method, called beforequeue.close().include/ctranslate2/replica_pool.h:ReplicaWorkeroverridesprepare_shutdown()to set_shutting_downatomic flag with release semantics.idle()checks the flag with acquire semantics before callingsynchronize_stream(), preventing the race.finalize()also sets the flag before resetting the replica.std::atomic<bool> _shutting_downmember.src/thread_pool.cc:ThreadPool::~ThreadPool()callsprepare_shutdown()on all workers BEFOREqueue.close().JobQueue::get()releases the mutex before callingbefore_wait()callback and checks_request_endto avoid callingsynchronize_stream()during shutdown.Worker::join()implements timed join usingstd::promise/std::futurewith detach fallback.Test plan