Skip to content

Fix health latency under concurrent VL request preparation#4570

Open
CUHKSZzxy wants to merge 4 commits intoInternLM:mainfrom
CUHKSZzxy:fix/health-event-loop-starvation
Open

Fix health latency under concurrent VL request preparation#4570
CUHKSZzxy wants to merge 4 commits intoInternLM:mainfrom
CUHKSZzxy:fix/health-event-loop-starvation

Conversation

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy commented May 7, 2026

Summary

Fix slow server responsiveness under concurrent VL requests by moving CPU-heavy request preparation off the FastAPI event loop and gating executor submissions.

Root Cause

VL request preparation can spend noticeable time in synchronous Python work: chat-template rendering, tokenizer encoding, get_input_prompt(), VL preprocess, wrapping, and vision helper calls.

When many requests arrive together, this work can either block the uvicorn/FastAPI event loop directly or build a large single-worker executor backlog. A lightweight endpoint probe then waits behind request preparation and appears stuck. Once the burst drains, later probes return immediately.

Fix

  • Move text prompt rendering and tokenization into an executor.
  • Move new-style VL get_input_prompt() into an executor.
  • Add async locks around prompt and ImageEncoder executor submissions so waiters yield to the event loop instead of building a large backlog.
  • Offload synchronous to_pytorch() and to_turbomind() wrapping to the existing image encoder executor.
  • Preserve executor gating after client cancellation by waiting for protected executor futures before releasing the lock, then propagating CancelledError.

This does not add a separate health port or change endpoint routing.

Reproduction

A minimal FastAPI reproduction, without any real model or private request data:

# repro_event_loop_block.py
import time

import uvicorn
from fastapi import FastAPI

app = FastAPI()

@app.get('/health')
async def health():
    return {}

@app.post('/v1/chat/completions')
async def chat():
    time.sleep(10)  # simulates sync prompt/tokenizer/VL prep on the event loop
    return {'ok': True}

uvicorn.run(app, host='127.0.0.1', port=5678)

Run it, then start the blocking request and probe health while it is active:

python repro_event_loop_block.py
curl -X POST http://127.0.0.1:5678/v1/chat/completions &
time curl -i http://127.0.0.1:5678/health

Expected behavior before this fix: /health waits for the blocking request-prep work to finish. This mirrors the observed LMDeploy symptom: the health route itself is trivial, but it cannot run while the server event loop is occupied by synchronous request preparation.

Tests

This PR also adds a deterministic cancellation regression test for the executor-gating path:

pytest tests/test_lmdeploy/test_executor_cancellation.py

Before the cancellation fix, the test fails because a cancelled caller can release the async lock while protected executor work is still pending.

Validation

conda activate vl
pytest tests/test_lmdeploy/test_executor_cancellation.py
python -m py_compile lmdeploy/utils.py lmdeploy/serve/processors/multimodal.py lmdeploy/vl/engine.py tests/test_lmdeploy/test_executor_cancellation.py
pre-commit run --files lmdeploy/utils.py lmdeploy/serve/processors/multimodal.py lmdeploy/vl/engine.py tests/test_lmdeploy/test_executor_cancellation.py

Copilot AI review requested due to automatic review settings May 7, 2026 09:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets slow /health responses observed under bursts of concurrent vision-language (VL) requests by moving CPU-heavy request preparation off the FastAPI event loop and introducing async locking to prevent executor backlogs.

Changes:

  • Add an async lock around ImageEncoder executor usage to throttle VL preprocess/forward/wrap work submissions.
  • Offload PyTorch/TurboMind wrapping (to_pytorch*, to_turbomind) into the ImageEncoder thread executor.
  • Offload chat template rendering/tokenization and new-style VL get_input_prompt() into an executor and serialize these operations with an async lock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
lmdeploy/vl/engine.py Serializes and offloads VL preprocess/forward/wrapping work to a single-thread executor to reduce event-loop starvation.
lmdeploy/serve/processors/multimodal.py Moves prompt rendering/tokenization and get_input_prompt() into an executor and gates concurrent submissions to improve /health responsiveness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/vl/engine.py Outdated
Comment thread lmdeploy/vl/engine.py Outdated
Comment thread lmdeploy/serve/processors/multimodal.py
Comment thread lmdeploy/serve/processors/multimodal.py Outdated
@lvhan028 lvhan028 added the Bug:P0 label May 8, 2026
@lvhan028 lvhan028 requested review from irexyc May 8, 2026 09:41
@irexyc irexyc force-pushed the fix/health-event-loop-starvation branch 2 times, most recently from 47238ba to 70301bd Compare May 8, 2026 13:40
@irexyc
Copy link
Copy Markdown
Collaborator

irexyc commented May 8, 2026

what about using asyncio.Semaphore insted of asyncio.Lock()

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator Author

what about using asyncio.Semaphore insted of asyncio.Lock()

Get some response from GPT5.5, which looks reasonable to me

I’d use an asyncio.Lock, not a wider semaphore, for this specific health-lag fix.

A lock is basically Semaphore(1), and here 1 is the value I’d recommend.

Why:

The problematic section is synchronous / GIL-heavy prompt rendering or preprocessing before the request fully enters normal async scheduling.
run_in_executor moves that work off the event-loop thread, but if many requests submit CPU-heavy jobs at once, executor threads can still compete with the event loop through the GIL.
The async lock makes only one request submit/run that sensitive prep section at a time.
Other requests wait at await lock.acquire(), which yields control back to the event loop, so /health can still be scheduled quickly.
A semaphore like asyncio.Semaphore(2) or 4 might improve request-prep parallelism, but it also reintroduces the risk: multiple GIL-heavy executor jobs can again compete with the event loop. So I would not start there.

I think we can use a semaphore, but the value filled in Semaphore will be the new issue now.

What would you prefer? @lvhan028

@irexyc
Copy link
Copy Markdown
Collaborator

irexyc commented May 9, 2026

I just think the logic of await_executor_future is a bit complicated. If we use asyncio.Semaphore, could we eliminate this function?

@lvhan028
Copy link
Copy Markdown
Collaborator

lvhan028 commented May 9, 2026

How much impact does it have on performance?
If image url rather than base64 is passed, will the performance drop dramatically?

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator Author

CUHKSZzxy commented May 9, 2026

How much impact does it have on performance? If image url rather than base64 is passed, will the performance drop dramatically?

I did some quick benchmarks on text, base64 image, image http url. With qwen3.5-35b, TP=2

In short, text performance won't be affected, long base64 message improves, image http url message hurts.

Benchmark Results

These benchmarks compare main and the current branch under text-only, base64 image, and HTTP image URL workloads.

Summary

Workload Branch Requests Output Length Wall Time / Duration Throughput Mean Latency Median Latency
Text main 5000 2048 668.23 s 7.48 req/s 402589.61 ms 410340.81 ms
Text current 5000 2048 667.15 s 7.49 req/s 400743.65 ms 399351.97 ms
Base64 image main 20 4096 277.36 s 0.07 req/s 273.00 s 277.29 s
Base64 image current 20 4096 245.87 s 0.08 req/s 241.99 s 245.72 s
HTTP image URL main 1000 256 29.00 s 34.48 req/s 27.35 s 26.96 s
HTTP image URL current 1000 256 34.88 s 28.67 req/s 33.44 s 33.32 s

Text-Only Benchmark

Dataset: ShareGPT
Prompts: 5000
Output length: 2048
Input tokens: 1,229,468
Output tokens: 10,240,000

Metric main current
Benchmark duration 668.23 s 667.15 s
Request throughput 7.48 req/s 7.49 req/s
Input token throughput 1839.88 tok/s 1842.85 tok/s
Output token throughput 15324.00 tok/s 15348.77 tok/s
Mean E2E latency 402589.61 ms 400743.65 ms
Median E2E latency 410340.81 ms 399351.97 ms
Mean TTFT 274000.04 ms 275300.25 ms
Median TTFT 279548.01 ms 277838.53 ms
P99 TTFT 560014.47 ms 559741.67 ms
Mean TPOT 62.82 ms 61.28 ms
Median TPOT 63.82 ms 60.57 ms
P99 TPOT 75.43 ms 74.52 ms
Mean ITL 891.75 ms 845.23 ms
Median ITL 792.10 ms 773.61 ms
P99 ITL 2696.69 ms 1918.19 ms

Base64 Image Benchmark

Requests: 20
Output length: 4096

Metric main current
Total wall time 277.36 s 245.87 s
Successful requests 20 20
Failed requests 0 0
Min latency 247.95 s 208.91 s
Median latency 277.29 s 245.72 s
Mean latency 273.00 s 241.99 s
Max latency 277.35 s 245.87 s
Latency stdev 9.21 s 9.33 s
Throughput 0.07 req/s 0.08 req/s

HTTP Image URL Benchmark

Requests: 1000
Output length: 256

Metric main current
Total wall time 29.00 s 34.88 s
Successful requests 1000 1000
Failed requests 0 0
Min latency 25.88 s 32.33 s
Median latency 26.96 s 33.32 s
Mean latency 27.35 s 33.44 s
Max latency 28.85 s 34.73 s
Latency stdev 1.03 s 0.68 s
Throughput 34.48 req/s 28.67 req/s

Notes

The current branch keeps text-only throughput roughly unchanged and improves the heavy base64-image case. The HTTP image URL case is slower in this run, which is expected to be the main tradeoff to watch because its request payload is much smaller and benefits less from serializing request preparation.

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator Author

CUHKSZzxy commented May 9, 2026

I just think the logic of await_executor_future is a bit complicated. If we use asyncio.Semaphore, could we eliminate this function?

Agreed that it looks complicated, I simplified await_executor_future, but I haven't come up with a solution to remove this helper function, which is introduced to solve copilot review comments.

The helper is not mainly about choosing lock vs semaphore. It handles cancellation semantics around run_in_executor: if the coroutine is cancelled while awaiting the executor future, the async context manager would otherwise release the lock/semaphore immediately, while the underlying thread work keeps running. That would allow later requests to submit more executor work and rebuild the backlog.

So even with a semaphore, we still need to shield the executor future and keep the async gate held until the executor job finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants