-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a server hang issue with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. This is a good approach to prevent semaphore leaks in subprocesses. The implementation correctly moves the initialization logic into a new _init_tokenizer method, triggered by a tokenizer property.
However, I've identified a critical race condition in the lazy initialization logic. Without proper locking, concurrent access from multiple threads could lead to duplicate tokenizer initializations, re-introducing the very problem this PR aims to solve. I've provided a suggestion to implement thread-safe initialization using a double-checked locking pattern.
c163473 to
4a66ba3
Compare
|
Good catch on the race condition! I've added thread-safe initialization using the double-checked locking pattern with a |
|
Hi @kitaekatt, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4a66ba3 to
08bf49e
Compare
|
Thank you for catching the race condition. I've implemented thread-safe lazy initialization using a double-checked locking pattern with
Pre-commit checks now pass. |
…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com>
08bf49e to
a72d1f9
Compare
Summary
Fixes server hang when loading GGUF models without precomputed tokenizer merges.
The Problem:
bartowski/Phi-3.5-mini-instruct-GGUF) triggerbuild_merges_on_the_flyin transformersThe Fix:
Make tokenizer initialization lazy in
StructuredOutputManager:grammar_init()is first called (i.e., when structured output is actually needed)Test plan
bartowski/Phi-3.5-mini-instruct-GGUF(Q5_K_M)/v1/modelsendpointReproduction
Related Issues
leaked shared_memorywarnings reported by multiprocessing when using vLLM #8803 (shared_memory leak, different symptom)🤖 Generated with Claude Code