Fix health latency under concurrent VL request preparation#4570
Fix health latency under concurrent VL request preparation#4570CUHKSZzxy wants to merge 4 commits intoInternLM:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets slow /health responses observed under bursts of concurrent vision-language (VL) requests by moving CPU-heavy request preparation off the FastAPI event loop and introducing async locking to prevent executor backlogs.
Changes:
- Add an async lock around
ImageEncoderexecutor usage to throttle VL preprocess/forward/wrap work submissions. - Offload PyTorch/TurboMind wrapping (
to_pytorch*,to_turbomind) into theImageEncoderthread executor. - Offload chat template rendering/tokenization and new-style VL
get_input_prompt()into an executor and serialize these operations with an async lock.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lmdeploy/vl/engine.py | Serializes and offloads VL preprocess/forward/wrapping work to a single-thread executor to reduce event-loop starvation. |
| lmdeploy/serve/processors/multimodal.py | Moves prompt rendering/tokenization and get_input_prompt() into an executor and gates concurrent submissions to improve /health responsiveness. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
47238ba to
70301bd
Compare
|
what about using asyncio.Semaphore insted of asyncio.Lock() |
Get some response from GPT5.5, which looks reasonable to me I think we can use a semaphore, but the value filled in What would you prefer? @lvhan028 |
|
I just think the logic of |
|
How much impact does it have on performance? |
I did some quick benchmarks on text, base64 image, image http url. With qwen3.5-35b, TP=2 In short, text performance won't be affected, long base64 message improves, image http url message hurts. Benchmark ResultsThese benchmarks compare Summary
Text-Only BenchmarkDataset: ShareGPT
Base64 Image BenchmarkRequests: 20
HTTP Image URL BenchmarkRequests: 1000
NotesThe current branch keeps text-only throughput roughly unchanged and improves the heavy base64-image case. The HTTP image URL case is slower in this run, which is expected to be the main tradeoff to watch because its request payload is much smaller and benefits less from serializing request preparation. |
Agreed that it looks complicated, I simplified The helper is not mainly about choosing lock vs semaphore. It handles cancellation semantics around So even with a semaphore, we still need to shield the executor future and keep the async gate held until the executor job finishes. |
Summary
Fix slow server responsiveness under concurrent VL requests by moving CPU-heavy request preparation off the FastAPI event loop and gating executor submissions.
Root Cause
VL request preparation can spend noticeable time in synchronous Python work: chat-template rendering, tokenizer encoding,
get_input_prompt(), VL preprocess, wrapping, and vision helper calls.When many requests arrive together, this work can either block the uvicorn/FastAPI event loop directly or build a large single-worker executor backlog. A lightweight endpoint probe then waits behind request preparation and appears stuck. Once the burst drains, later probes return immediately.
Fix
get_input_prompt()into an executor.ImageEncoderexecutor submissions so waiters yield to the event loop instead of building a large backlog.to_pytorch()andto_turbomind()wrapping to the existing image encoder executor.CancelledError.This does not add a separate health port or change endpoint routing.
Reproduction
A minimal FastAPI reproduction, without any real model or private request data:
Run it, then start the blocking request and probe health while it is active:
Expected behavior before this fix:
/healthwaits for the blocking request-prep work to finish. This mirrors the observed LMDeploy symptom: the health route itself is trivial, but it cannot run while the server event loop is occupied by synchronous request preparation.Tests
This PR also adds a deterministic cancellation regression test for the executor-gating path:
Before the cancellation fix, the test fails because a cancelled caller can release the async lock while protected executor work is still pending.
Validation