Fix ThreadPool shutdown deadlock on Windows with CUDA by Deverydoo · Pull Request #2027 · OpenNMT/CTranslate2

Deverydoo · 2026-03-23T22:08:27Z

Summary

Fixes application hang when destroying CTranslate2 models on Windows with CUDA GPU acceleration. The ThreadPool destructor deadlocks because worker threads get stuck in blocking CUDA synchronization calls during shutdown.

Problem

When a CTranslate2 model is destroyed (e.g., after Whisper transcription or NLLB translation completes), the ThreadPool destructor calls queue.close() then worker->join(). A race condition causes deadlock:

Worker thread enters idle() → synchronize_stream() (blocking CUDA call)
Main thread calls queue.close() → sets _request_end, notifies condition variable
Worker is blocked inside synchronize_stream(), not waiting on the condition variable — notification is lost
Worker::join() blocks indefinitely waiting for the worker to exit

This affects any application using CTranslate2 with CUDA on Windows. Confirmed on:

NVIDIA RTX 4090 (Ada Lovelace, sm_89) — Windows 10, CUDA 12.8 and 13.2
NVIDIA RTX 5070 Laptop (Blackwell, sm_120) — Windows 11, CUDA 13.2

Related issues: #1782, SYSTRAN/faster-whisper#71

Changes

include/ctranslate2/thread_pool.h:

Worker::join() now accepts timeout_ms parameter (default 5000ms). If the worker doesn't finish in time, both threads are detached to prevent indefinite blocking.
Added Worker::prepare_shutdown() virtual method, called before queue.close().

include/ctranslate2/replica_pool.h:

ReplicaWorker overrides prepare_shutdown() to set _shutting_down atomic flag with release semantics.
idle() checks the flag with acquire semantics before calling synchronize_stream(), preventing the race.
finalize() also sets the flag before resetting the replica.
Added std::atomic<bool> _shutting_down member.

src/thread_pool.cc:

ThreadPool::~ThreadPool() calls prepare_shutdown() on all workers BEFORE queue.close().
JobQueue::get() releases the mutex before calling before_wait() callback and checks _request_end to avoid calling synchronize_stream() during shutdown.
Worker::join() implements timed join using std::promise/std::future with detach fallback.

Test plan

Whisper large-v3-turbo transcription on RTX 4090 (CUDA 13.2, Windows 10)
NLLB-200 translation after transcription on RTX 4090
Whisper transcription on RTX 5070 Blackwell (CUDA 13.2, Windows 11)
Application exits cleanly after transcription + translation
Multiple sequential transcription runs without hang
No regression in inference accuracy or performance

On Windows, the ThreadPool destructor can deadlock during CUDA model cleanup. The root cause is a race condition between queue.close() and the worker's idle() callback: 1. Worker enters idle() -> synchronize_stream() (blocking CUDA call) 2. Main thread calls queue.close() -> sets _request_end, notifies 3. Worker is stuck in synchronize_stream(), misses the notification 4. Worker::join() blocks indefinitely This manifests as application hangs when unloading Whisper/NLLB models after transcription or translation completes. Confirmed on RTX 4090 and RTX 5070 (Blackwell) with CUDA 12.x and 13.x. Changes: - Add Worker::prepare_shutdown() virtual method, called by ThreadPool::~ThreadPool() BEFORE queue.close(). This allows workers to stop blocking idle operations before the queue signals shutdown. - ReplicaWorker overrides prepare_shutdown() to set _shutting_down atomic flag with release semantics. The idle() method checks this flag with acquire semantics before calling synchronize_stream(). - Worker::join() now accepts a timeout_ms parameter (default 5000ms). If the worker thread doesn't finish within the timeout, both the worker thread and the join helper are detached to prevent blocking the process. This handles the case where finalize() -> _replica.reset() hangs on CUDA resource deallocation. - JobQueue::get() before_wait loop now checks _request_end before calling before_wait(), and releases the lock during the callback to prevent holding the mutex during blocking CUDA operations. Tested with CTranslate2 Whisper and NLLB models on Windows 10/11 with CUDA 12.8 and 13.2 (sm_75 through sm_120).

Purfview · 2026-03-23T23:36:16Z

There is another PR for the same issue: #1912

3manifold · 2026-03-24T12:01:35Z

I think it's time to replace third_party/BS_thread_pool_light.hpp, include/ctranslate2/thread_pool.h etc. with the latest version of https://github.com/bshoshany/thread-pool (where the (deprecated) BS_thread_pool_light.hpp idea started in the first place).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ThreadPool shutdown deadlock on Windows with CUDA#2027

Fix ThreadPool shutdown deadlock on Windows with CUDA#2027
Deverydoo wants to merge 1 commit intoOpenNMT:masterfrom
Deverydoo:fix/windows-cuda-shutdown-hang

Deverydoo commented Mar 23, 2026

Uh oh!

Purfview commented Mar 23, 2026

Uh oh!

3manifold commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Deverydoo commented Mar 23, 2026

Summary

Problem

Changes

Test plan

Uh oh!

Purfview commented Mar 23, 2026

Uh oh!

3manifold commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants