Skip to content

Conversation

@kitaekatt
Copy link
Contributor

Summary

Fixes server hang when loading GGUF models without precomputed tokenizer merges.

The Problem:

  • GGUF models (e.g., bartowski/Phi-3.5-mini-instruct-GGUF) trigger build_merges_on_the_fly in transformers
  • This function is called in both APIServer and EngineCore subprocess
  • The subprocess invocation leaks a semaphore, causing server hang after showing:
    resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
    

The Fix:
Make tokenizer initialization lazy in StructuredOutputManager:

  • Tokenizer is only loaded when grammar_init() is first called (i.e., when structured output is actually needed)
  • Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded
  • For requests that do use structured output, tokenizer is loaded on-demand

Test plan

  • Tested locally with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M)
  • Server starts successfully without hang
  • Server responds to /v1/models endpoint
  • Server handles completions requests

Reproduction

# Without this fix, server hangs indefinitely after tokenizer merge build
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Phi-3.5-mini-instruct-Q5_K_M.gguf \
  --tokenizer microsoft/Phi-3.5-mini-instruct \
  --max-model-len 2048

# With this fix, server starts successfully

Related Issues

🤖 Generated with Claude Code

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a server hang issue with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. This is a good approach to prevent semaphore leaks in subprocesses. The implementation correctly moves the initialization logic into a new _init_tokenizer method, triggered by a tokenizer property.

However, I've identified a critical race condition in the lazy initialization logic. Without proper locking, concurrent access from multiple threads could lead to duplicate tokenizer initializations, re-introducing the very problem this PR aims to solve. I've provided a suggestion to implement thread-safe initialization using a double-checked locking pattern.

@kitaekatt kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from c163473 to 4a66ba3 Compare December 9, 2025 00:03
@kitaekatt
Copy link
Contributor Author

Good catch on the race condition! I've added thread-safe initialization using the double-checked locking pattern with a threading.Lock(). This ensures that even if multiple threads access the tokenizer property concurrently, _init_tokenizer() will only be called once. Thanks for the review!

@mergify
Copy link

mergify bot commented Dec 9, 2025

Hi @kitaekatt, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@kitaekatt kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from 4a66ba3 to 08bf49e Compare December 9, 2025 00:22
@kitaekatt
Copy link
Contributor Author

Thank you for catching the race condition. I've implemented thread-safe lazy initialization using a double-checked locking pattern with threading.Lock(). The fix ensures:

  1. Thread-safe access with _tokenizer_init_lock
  2. Double-checked locking to avoid redundant lock acquisition
  3. Proper type annotation (ThreadPoolExecutor | None) to satisfy mypy

Pre-commit checks now pass.

…ore leak

GGUF models without precomputed merges trigger `build_merges_on_the_fly`
in the transformers library, which uses multiprocessing primitives.
When this happens in both the APIServer process (for request validation)
and the EngineCore subprocess (via StructuredOutputManager), the
subprocess leaks a semaphore, causing the server to hang indefinitely.

This change makes tokenizer initialization lazy in StructuredOutputManager:
- Tokenizer is only loaded when grammar_init() is first called
- Most inference requests don't use structured output, so the tokenizer
  in EngineCore is never loaded
- For requests that do use structured output, tokenizer is loaded on-demand

The fix resolves the following symptoms:
- Server hangs after "resource_tracker: There appear to be 1 leaked
  semaphore objects to clean up at shutdown"
- Tokenizer merges being built twice (once in APIServer, once in EngineCore)
- GGUF models failing to start even though weights load successfully

Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M).

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/gguf-tokenizer-semaphore-leak branch from 08bf49e to a72d1f9 Compare December 9, 2025 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant