Encoder-Prefill-Decode (EPD) Disaggregation #274

yhyang201 · 2025-12-15T03:48:50Z

No description provided.

merrymercy · 2025-12-26T03:33:21Z

blog/2025-12-16-epd.md

+    --backend vllm-chat \
+    --request-rate $request_rate 
+```
+Mean TTFT (EPD vs colocate)


Describe the setup more precisely (e.g., how many GPUs each setting uses)

merrymercy · 2025-12-26T03:34:36Z

blog/2025-12-16-epd.md

+- **Flexible transfer backends**: Support for multiple transfer mechanisms (ZMQ, GPU-direct via Mooncake) allows optimization for different deployment scenarios.
+- **Vision embedding caching**: Frequently used images can be cached at encoder servers, eliminating redundant ViT computations and reducing network transfer overhead.
+
+For instance, in image-heavy scenarios, we leverage EPD to significantly reduce request TTFT under load (approximately 6–8× lower compared to the colocation approach at 1 QPS).


Need to mention when it is useful and when it is not useful.

merrymercy · 2025-12-26T03:36:37Z

blog/2025-12-16-epd.md

+- Encoder/prefill keeps TTFT much lower under load (≈6–8x lower than colocate at 1 qps).
+- TPOT stays far below colocate (≈8–10x lower), indicating much tighter latency.
+- Throughput roughly doubles at higher QPS (≈2x at 0.8–1.0 qps vs. colocate).
+


It will be great to mention the trade-off here

We use more GPUs to reduce the TTFT

Sometimes, the encoder GPUs might be idle if we do not have that many images.

gty111 · 2025-12-28T08:16:02Z

blog/2025-12-16-epd.md

+- Registers embeddings in shared memory
+- High-bandwidth, low-latency transfer
+
+**Note**: The zmq_to_scheduler backend is not compatible with pipeline parallelism.


We can delete it now, since we support it in this PR

gty111 · 2025-12-28T08:29:47Z

EPD disaggregation can be advantage in single reques multi-images scenario. It is benefit both for latency and throughput.
Here are some benchmark results that keep the same number of GPUs between colocate and EPD disaggregation.

Qwen3-VL-30B-A3B, H100 x 8

colocate TP8 x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.10     
Total input tokens:                      379190    
Total input text tokens:                 72502     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.89   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5333.17   
Concurrency:                             1.29      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2865.13   
Median E2E Latency (ms):                 2530.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1769.44   
Median TTFT (ms):                        1699.01   
P99 TTFT (ms):                           6265.33   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

E (TP1) x 4 + PD (TP4) x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.12     
Total input tokens:                      379258    
Total input text tokens:                 72570     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.67   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5332.95   
Concurrency:                             0.66      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1468.98   
Median E2E Latency (ms):                 1404.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          960.22    
Median TTFT (ms):                        1052.19   
P99 TTFT (ms):                           2907.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Qwen2.5-7B-VL, H100 x 4

colocate TP4 x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.81     
Total input tokens:                      244968    
Total input text tokens:                 67664     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.79   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.98   
Concurrency:                             0.40      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1191.23   
Median E2E Latency (ms):                 1187.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          718.54    
Median TTFT (ms):                        717.21    
P99 TTFT (ms):                           2108.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

E (TP1) x 3 + PD (TP1) x 1

#Input tokens: 244991
#Output tokens: 18
#Total images: 148
#Images per request: min=1, max=8, mean=4.62

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.82     
Total input tokens:                      244991    
Total input text tokens:                 67687     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.73   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.91   
Concurrency:                             0.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.72    
Median E2E Latency (ms):                 611.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          378.07    
Median TTFT (ms):                        465.47    
P99 TTFT (ms):                           993.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

yhyang201 and others added 16 commits December 14, 2025 16:45

upd

9af1ac0

upd

b4666b1

add

6bee7b6

add Acknowledgment

c80c4cd

upd

cd13f9f

fix

aa21044

fix

c5b2920

fix

7fb5c85

upd

625216d

upd

a315d3d

upd

9032d81

fix

32b8798

fix

0bf47f6

add note

2d4076c

fix

389a75a

fix

7b28cde

yhyang201 changed the title ~~[WIP] Encoder-Prefill-Decode (EPD) Disaggregation~~ Encoder-Prefill-Decode (EPD) Disaggregation Dec 16, 2025

Delete blog/Untitled

be9db76

merrymercy reviewed Dec 26, 2025

View reviewed changes

fix

9719c16

gty111 reviewed Dec 28, 2025

View reviewed changes

delete

6b1e256

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encoder-Prefill-Decode (EPD) Disaggregation #274

Encoder-Prefill-Decode (EPD) Disaggregation #274

yhyang201 commented Dec 15, 2025

Uh oh!

merrymercy Dec 26, 2025

Uh oh!

liusy58 Dec 26, 2025

Uh oh!

merrymercy Dec 26, 2025

Uh oh!

liusy58 Dec 26, 2025

Uh oh!

merrymercy Dec 26, 2025

Uh oh!

liusy58 Dec 26, 2025

Uh oh!

gty111 Dec 28, 2025

Uh oh!

gty111 commented Dec 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Encoder-Prefill-Decode (EPD) Disaggregation #274

Are you sure you want to change the base?

Encoder-Prefill-Decode (EPD) Disaggregation #274

Conversation

yhyang201 commented Dec 15, 2025

Uh oh!

merrymercy Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

liusy58 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

liusy58 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

liusy58 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

gty111 Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

gty111 commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-VL-30B-A3B, H100 x 8

Qwen2.5-7B-VL, H100 x 4

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gty111 commented Dec 28, 2025 •

edited

Loading