[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

kitaekatt · 2025-12-09T00:00:44Z

Summary

Fixes server hang when loading GGUF models without precomputed tokenizer merges.

The Problem:

GGUF models (e.g., bartowski/Phi-3.5-mini-instruct-GGUF) trigger build_merges_on_the_fly in transformers
This function is called in both APIServer and EngineCore subprocess

The subprocess invocation leaks a semaphore, causing server hang after showing:

resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

The Fix:
Make tokenizer initialization lazy in StructuredOutputManager:

Tokenizer is only loaded when grammar_init() is first called (i.e., when structured output is actually needed)
Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded
For requests that do use structured output, tokenizer is loaded on-demand

Test plan

Tested locally with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M)
Server starts successfully without hang
Server responds to /v1/models endpoint
Server handles completions requests

Reproduction

# Without this fix, server hangs indefinitely after tokenizer merge build
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Phi-3.5-mini-instruct-Q5_K_M.gguf \
  --tokenizer microsoft/Phi-3.5-mini-instruct \
  --max-model-len 2048

# With this fix, server starts successfully

Related Issues

This is a new bug not covered by existing issues
Related to [Misc]: Strange leaked shared_memory warnings reported by multiprocessing when using vLLM #8803 (shared_memory leak, different symptom)
Related to [Bug]: GGUF tokenization issues #12985 (GGUF tokenization quality, different issue)

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request addresses a server hang issue with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. This is a good approach to prevent semaphore leaks in subprocesses. The implementation correctly moves the initialization logic into a new _init_tokenizer method, triggered by a tokenizer property.

However, I've identified a critical race condition in the lazy initialization logic. Without proper locking, concurrent access from multiple threads could lead to duplicate tokenizer initializations, re-introducing the very problem this PR aims to solve. I've provided a suggestion to implement thread-safe initialization using a double-checked locking pattern.

vllm/v1/structured_output/__init__.py

kitaekatt · 2025-12-09T00:03:59Z

Good catch on the race condition! I've added thread-safe initialization using the double-checked locking pattern with a threading.Lock(). This ensures that even if multiple threads access the tokenizer property concurrently, _init_tokenizer() will only be called once. Thanks for the review!

mergify · 2025-12-09T00:08:02Z

Hi @kitaekatt, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

kitaekatt · 2025-12-09T00:30:03Z

Thank you for catching the race condition. I've implemented thread-safe lazy initialization using a double-checked locking pattern with threading.Lock(). The fix ensures:

Thread-safe access with _tokenizer_init_lock
Double-checked locking to avoid redundant lock acquisition
Proper type annotation (ThreadPoolExecutor | None) to satisfy mypy

Pre-commit checks now pass.

…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com>

kitaekatt requested review from aarnphm, benchislett, mgoin and russellb as code owners December 9, 2025 00:00

mergify bot added structured-output v1 labels Dec 9, 2025

github-project-automation bot added this to Structured Output Dec 9, 2025

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Show resolved Hide resolved

kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from c163473 to 4a66ba3 Compare December 9, 2025 00:03

kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from 4a66ba3 to 08bf49e Compare December 9, 2025 00:22

kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from 08bf49e to a72d1f9 Compare December 9, 2025 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

mergify bot commented Dec 9, 2025

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

Are you sure you want to change the base?

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

Uh oh!

Conversation

kitaekatt commented Dec 9, 2025

Summary

Test plan

Reproduction

Related Issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

mergify bot commented Dec 9, 2025

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant