-
Notifications
You must be signed in to change notification settings - Fork 65
Encoder-Prefill-Decode (EPD) Disaggregation #274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| --backend vllm-chat \ | ||
| --request-rate $request_rate | ||
| ``` | ||
| Mean TTFT (EPD vs colocate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the setup more precisely (e.g., how many GPUs each setting uses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
blog/2025-12-16-epd.md
Outdated
| - **Flexible transfer backends**: Support for multiple transfer mechanisms (ZMQ, GPU-direct via Mooncake) allows optimization for different deployment scenarios. | ||
| - **Vision embedding caching**: Frequently used images can be cached at encoder servers, eliminating redundant ViT computations and reducing network transfer overhead. | ||
|
|
||
| For instance, in image-heavy scenarios, we leverage EPD to significantly reduce request TTFT under load (approximately 6–8× lower compared to the colocation approach at 1 QPS). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to mention when it is useful and when it is not useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| - Encoder/prefill keeps TTFT much lower under load (≈6–8x lower than colocate at 1 qps). | ||
| - TPOT stays far below colocate (≈8–10x lower), indicating much tighter latency. | ||
| - Throughput roughly doubles at higher QPS (≈2x at 0.8–1.0 qps vs. colocate). | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great to mention the trade-off here
- We use more GPUs to reduce the TTFT
- Sometimes, the encoder GPUs might be idle if we do not have that many images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
blog/2025-12-16-epd.md
Outdated
| - Registers embeddings in shared memory | ||
| - High-bandwidth, low-latency transfer | ||
|
|
||
| **Note**: The zmq_to_scheduler backend is not compatible with pipeline parallelism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete it now, since we support it in this PR
|
EPD disaggregation can be advantage in single reques multi-images scenario. It is benefit both for latency and throughput. Qwen3-VL-30B-A3B, H100 x 8
Qwen2.5-7B-VL, H100 x 4
|
No description provided.