Skip to content

Fix ThreadPool shutdown deadlock on Windows with CUDA#2027

Open
Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Deverydoo:fix/windows-cuda-shutdown-hang
Open

Fix ThreadPool shutdown deadlock on Windows with CUDA#2027
Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Deverydoo:fix/windows-cuda-shutdown-hang

Conversation

@Deverydoo
Copy link
Copy Markdown

Summary

Fixes application hang when destroying CTranslate2 models on Windows with CUDA GPU acceleration. The ThreadPool destructor deadlocks because worker threads get stuck in blocking CUDA synchronization calls during shutdown.

Problem

When a CTranslate2 model is destroyed (e.g., after Whisper transcription or NLLB translation completes), the ThreadPool destructor calls queue.close() then worker->join(). A race condition causes deadlock:

  1. Worker thread enters idle()synchronize_stream() (blocking CUDA call)
  2. Main thread calls queue.close() → sets _request_end, notifies condition variable
  3. Worker is blocked inside synchronize_stream(), not waiting on the condition variable — notification is lost
  4. Worker::join() blocks indefinitely waiting for the worker to exit

This affects any application using CTranslate2 with CUDA on Windows. Confirmed on:

  • NVIDIA RTX 4090 (Ada Lovelace, sm_89) — Windows 10, CUDA 12.8 and 13.2
  • NVIDIA RTX 5070 Laptop (Blackwell, sm_120) — Windows 11, CUDA 13.2

Related issues: #1782, SYSTRAN/faster-whisper#71

Changes

include/ctranslate2/thread_pool.h:

  • Worker::join() now accepts timeout_ms parameter (default 5000ms). If the worker doesn't finish in time, both threads are detached to prevent indefinite blocking.
  • Added Worker::prepare_shutdown() virtual method, called before queue.close().

include/ctranslate2/replica_pool.h:

  • ReplicaWorker overrides prepare_shutdown() to set _shutting_down atomic flag with release semantics.
  • idle() checks the flag with acquire semantics before calling synchronize_stream(), preventing the race.
  • finalize() also sets the flag before resetting the replica.
  • Added std::atomic<bool> _shutting_down member.

src/thread_pool.cc:

  • ThreadPool::~ThreadPool() calls prepare_shutdown() on all workers BEFORE queue.close().
  • JobQueue::get() releases the mutex before calling before_wait() callback and checks _request_end to avoid calling synchronize_stream() during shutdown.
  • Worker::join() implements timed join using std::promise/std::future with detach fallback.

Test plan

  • Whisper large-v3-turbo transcription on RTX 4090 (CUDA 13.2, Windows 10)
  • NLLB-200 translation after transcription on RTX 4090
  • Whisper transcription on RTX 5070 Blackwell (CUDA 13.2, Windows 11)
  • Application exits cleanly after transcription + translation
  • Multiple sequential transcription runs without hang
  • No regression in inference accuracy or performance

On Windows, the ThreadPool destructor can deadlock during CUDA model
cleanup. The root cause is a race condition between queue.close() and
the worker's idle() callback:

1. Worker enters idle() -> synchronize_stream() (blocking CUDA call)
2. Main thread calls queue.close() -> sets _request_end, notifies
3. Worker is stuck in synchronize_stream(), misses the notification
4. Worker::join() blocks indefinitely

This manifests as application hangs when unloading Whisper/NLLB models
after transcription or translation completes. Confirmed on RTX 4090
and RTX 5070 (Blackwell) with CUDA 12.x and 13.x.

Changes:

- Add Worker::prepare_shutdown() virtual method, called by
  ThreadPool::~ThreadPool() BEFORE queue.close(). This allows
  workers to stop blocking idle operations before the queue signals
  shutdown.

- ReplicaWorker overrides prepare_shutdown() to set _shutting_down
  atomic flag with release semantics. The idle() method checks this
  flag with acquire semantics before calling synchronize_stream().

- Worker::join() now accepts a timeout_ms parameter (default 5000ms).
  If the worker thread doesn't finish within the timeout, both the
  worker thread and the join helper are detached to prevent blocking
  the process. This handles the case where finalize() -> _replica.reset()
  hangs on CUDA resource deallocation.

- JobQueue::get() before_wait loop now checks _request_end before
  calling before_wait(), and releases the lock during the callback
  to prevent holding the mutex during blocking CUDA operations.

Tested with CTranslate2 Whisper and NLLB models on Windows 10/11
with CUDA 12.8 and 13.2 (sm_75 through sm_120).
@Purfview
Copy link
Copy Markdown
Contributor

There is another PR for the same issue: #1912

@3manifold
Copy link
Copy Markdown
Contributor

I think it's time to replace third_party/BS_thread_pool_light.hpp, include/ctranslate2/thread_pool.h etc. with the latest version of https://github.com/bshoshany/thread-pool (where the (deprecated) BS_thread_pool_light.hpp idea started in the first place).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants