Fix health latency under concurrent VL request preparation by CUHKSZzxy · Pull Request #4570 · InternLM/lmdeploy

CUHKSZzxy · 2026-05-07T09:44:04Z

Summary

Fix slow server responsiveness under concurrent VL requests by moving CPU-heavy request preparation off the FastAPI event loop and gating executor submissions.

Root Cause

VL request preparation can spend noticeable time in synchronous Python work: chat-template rendering, tokenizer encoding, get_input_prompt(), VL preprocess, wrapping, and vision helper calls.

When many requests arrive together, this work can either block the uvicorn/FastAPI event loop directly or build a large single-worker executor backlog. A lightweight endpoint probe then waits behind request preparation and appears stuck. Once the burst drains, later probes return immediately.

Fix

Move text prompt rendering and tokenization into an executor.
Move new-style VL get_input_prompt() into an executor.
Add async locks around prompt and ImageEncoder executor submissions so waiters yield to the event loop instead of building a large backlog.
Offload synchronous to_pytorch() and to_turbomind() wrapping to the existing image encoder executor.
Preserve executor gating after client cancellation by waiting for protected executor futures before releasing the lock, then propagating CancelledError.

This does not add a separate health port or change endpoint routing.

Reproduction

A minimal FastAPI reproduction, without any real model or private request data:

# repro_event_loop_block.py
import time

import uvicorn
from fastapi import FastAPI

app = FastAPI()

@app.get('/health')
async def health():
    return {}

@app.post('/v1/chat/completions')
async def chat():
    time.sleep(10)  # simulates sync prompt/tokenizer/VL prep on the event loop
    return {'ok': True}

uvicorn.run(app, host='127.0.0.1', port=5678)

Run it, then start the blocking request and probe health while it is active:

python repro_event_loop_block.py
curl -X POST http://127.0.0.1:5678/v1/chat/completions &
time curl -i http://127.0.0.1:5678/health

Expected behavior before this fix: /health waits for the blocking request-prep work to finish. This mirrors the observed LMDeploy symptom: the health route itself is trivial, but it cannot run while the server event loop is occupied by synchronous request preparation.

Tests

This PR also adds a deterministic cancellation regression test for the executor-gating path:

pytest tests/test_lmdeploy/test_executor_cancellation.py

Before the cancellation fix, the test fails because a cancelled caller can release the async lock while protected executor work is still pending.

Validation

conda activate vl
pytest tests/test_lmdeploy/test_executor_cancellation.py
python -m py_compile lmdeploy/utils.py lmdeploy/serve/processors/multimodal.py lmdeploy/vl/engine.py tests/test_lmdeploy/test_executor_cancellation.py
pre-commit run --files lmdeploy/utils.py lmdeploy/serve/processors/multimodal.py lmdeploy/vl/engine.py tests/test_lmdeploy/test_executor_cancellation.py

Copilot

Pull request overview

This PR targets slow /health responses observed under bursts of concurrent vision-language (VL) requests by moving CPU-heavy request preparation off the FastAPI event loop and introducing async locking to prevent executor backlogs.

Changes:

Add an async lock around ImageEncoder executor usage to throttle VL preprocess/forward/wrap work submissions.
Offload PyTorch/TurboMind wrapping (to_pytorch*, to_turbomind) into the ImageEncoder thread executor.
Offload chat template rendering/tokenization and new-style VL get_input_prompt() into an executor and serialize these operations with an async lock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
lmdeploy/vl/engine.py	Serializes and offloads VL preprocess/forward/wrapping work to a single-thread executor to reduce event-loop starvation.
lmdeploy/serve/processors/multimodal.py	Moves prompt rendering/tokenization and `get_input_prompt()` into an executor and gates concurrent submissions to improve `/health` responsiveness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

irexyc · 2026-05-08T13:41:16Z

what about using asyncio.Semaphore insted of asyncio.Lock()

CUHKSZzxy · 2026-05-09T04:47:44Z

what about using asyncio.Semaphore insted of asyncio.Lock()

Get some response from GPT5.5, which looks reasonable to me

I’d use an asyncio.Lock, not a wider semaphore, for this specific health-lag fix.

A lock is basically Semaphore(1), and here 1 is the value I’d recommend.

Why:

The problematic section is synchronous / GIL-heavy prompt rendering or preprocessing before the request fully enters normal async scheduling.
run_in_executor moves that work off the event-loop thread, but if many requests submit CPU-heavy jobs at once, executor threads can still compete with the event loop through the GIL.
The async lock makes only one request submit/run that sensitive prep section at a time.
Other requests wait at await lock.acquire(), which yields control back to the event loop, so /health can still be scheduled quickly.
A semaphore like asyncio.Semaphore(2) or 4 might improve request-prep parallelism, but it also reintroduces the risk: multiple GIL-heavy executor jobs can again compete with the event loop. So I would not start there.

I think we can use a semaphore, but the value filled in Semaphore will be the new issue now.

What would you prefer? @lvhan028

irexyc · 2026-05-09T05:39:13Z

I just think the logic of await_executor_future is a bit complicated. If we use asyncio.Semaphore, could we eliminate this function?

lvhan028 · 2026-05-09T05:47:40Z

How much impact does it have on performance?
If image url rather than base64 is passed, will the performance drop dramatically?

CUHKSZzxy · 2026-05-09T07:27:42Z

How much impact does it have on performance? If image url rather than base64 is passed, will the performance drop dramatically?

I did some quick benchmarks on text, base64 image, image http url. With qwen3.5-35b, TP=2

In short, text performance won't be affected, long base64 message improves, image http url message hurts.

Benchmark Results

These benchmarks compare main and the current branch under text-only, base64 image, and HTTP image URL workloads.

Summary

Workload	Branch	Requests	Output Length	Wall Time / Duration	Throughput	Mean Latency	Median Latency
Text	main	5000	2048	668.23 s	7.48 req/s	402589.61 ms	410340.81 ms
Text	current	5000	2048	667.15 s	7.49 req/s	400743.65 ms	399351.97 ms
Base64 image	main	20	4096	277.36 s	0.07 req/s	273.00 s	277.29 s
Base64 image	current	20	4096	245.87 s	0.08 req/s	241.99 s	245.72 s
HTTP image URL	main	1000	256	29.00 s	34.48 req/s	27.35 s	26.96 s
HTTP image URL	current	1000	256	34.88 s	28.67 req/s	33.44 s	33.32 s

Text-Only Benchmark

Dataset: ShareGPT
Prompts: 5000
Output length: 2048
Input tokens: 1,229,468
Output tokens: 10,240,000

Metric	main	current
Benchmark duration	668.23 s	667.15 s
Request throughput	7.48 req/s	7.49 req/s
Input token throughput	1839.88 tok/s	1842.85 tok/s
Output token throughput	15324.00 tok/s	15348.77 tok/s
Mean E2E latency	402589.61 ms	400743.65 ms
Median E2E latency	410340.81 ms	399351.97 ms
Mean TTFT	274000.04 ms	275300.25 ms
Median TTFT	279548.01 ms	277838.53 ms
P99 TTFT	560014.47 ms	559741.67 ms
Mean TPOT	62.82 ms	61.28 ms
Median TPOT	63.82 ms	60.57 ms
P99 TPOT	75.43 ms	74.52 ms
Mean ITL	891.75 ms	845.23 ms
Median ITL	792.10 ms	773.61 ms
P99 ITL	2696.69 ms	1918.19 ms

Base64 Image Benchmark

Requests: 20
Output length: 4096

Metric	main	current
Total wall time	277.36 s	245.87 s
Successful requests	20	20
Failed requests	0	0
Min latency	247.95 s	208.91 s
Median latency	277.29 s	245.72 s
Mean latency	273.00 s	241.99 s
Max latency	277.35 s	245.87 s
Latency stdev	9.21 s	9.33 s
Throughput	0.07 req/s	0.08 req/s

HTTP Image URL Benchmark

Requests: 1000
Output length: 256

Metric	main	current
Total wall time	29.00 s	34.88 s
Successful requests	1000	1000
Failed requests	0	0
Min latency	25.88 s	32.33 s
Median latency	26.96 s	33.32 s
Mean latency	27.35 s	33.44 s
Max latency	28.85 s	34.73 s
Latency stdev	1.03 s	0.68 s
Throughput	34.48 req/s	28.67 req/s

Notes

The current branch keeps text-only throughput roughly unchanged and improves the heavy base64-image case. The HTTP image URL case is slower in this run, which is expected to be the main tradeoff to watch because its request payload is much smaller and benefits less from serializing request preparation.

CUHKSZzxy · 2026-05-09T07:47:20Z

I just think the logic of await_executor_future is a bit complicated. If we use asyncio.Semaphore, could we eliminate this function?

Agreed that it looks complicated, I simplified await_executor_future, but I haven't come up with a solution to remove this helper function, which is introduced to solve copilot review comments.

The helper is not mainly about choosing lock vs semaphore. It handles cancellation semantics around run_in_executor: if the coroutine is cancelled while awaiting the executor future, the async context manager would otherwise release the lock/semaphore immediately, while the underlying thread work keeps running. That would allow later requests to submit more executor work and rebuild the backlog.

So even with a semaphore, we still need to shield the executor future and keep the async gate held until the executor job finishes.

fix: avoid starving server loop during VL request prep

7e46278

Copilot AI review requested due to automatic review settings May 7, 2026 09:44

Copilot started reviewing on behalf of CUHKSZzxy May 7, 2026 09:44 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread lmdeploy/vl/engine.py Outdated

Comment thread lmdeploy/vl/engine.py Outdated

Comment thread lmdeploy/serve/processors/multimodal.py

Comment thread lmdeploy/serve/processors/multimodal.py Outdated

CUHKSZzxy added 2 commits May 7, 2026 18:09

fix: preserve executor gates after cancellation

c74f8a2

docs: clarify executor gating locks

70301bd

lvhan028 added the Bug:P0 label May 8, 2026

lvhan028 requested review from irexyc May 8, 2026 09:41

irexyc force-pushed the fix/health-event-loop-starvation branch 2 times, most recently from 47238ba to 70301bd Compare May 8, 2026 13:40

refactor: simplify executor cancellation helper

ceb3456

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix health latency under concurrent VL request preparation#4570

Fix health latency under concurrent VL request preparation#4570
CUHKSZzxy wants to merge 4 commits intoInternLM:mainfrom
CUHKSZzxy:fix/health-event-loop-starvation

CUHKSZzxy commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irexyc commented May 8, 2026

Uh oh!

CUHKSZzxy commented May 9, 2026

Uh oh!

irexyc commented May 9, 2026

Uh oh!

lvhan028 commented May 9, 2026

Uh oh!

CUHKSZzxy commented May 9, 2026 •

edited

Loading

Uh oh!

CUHKSZzxy commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

CUHKSZzxy commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Reproduction

Tests

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irexyc commented May 8, 2026

Uh oh!

CUHKSZzxy commented May 9, 2026

Uh oh!

irexyc commented May 9, 2026

Uh oh!

lvhan028 commented May 9, 2026

Uh oh!

CUHKSZzxy commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Summary

Text-Only Benchmark

Base64 Image Benchmark

HTTP Image URL Benchmark

Notes

Uh oh!

CUHKSZzxy commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUHKSZzxy commented May 7, 2026 •

edited

Loading

CUHKSZzxy commented May 9, 2026 •

edited

Loading

CUHKSZzxy commented May 9, 2026 •

edited

Loading