deepseek v4 fp8_einsum enable by ganyi1996ppo · Pull Request #960 · ROCm/ATOM

ganyi1996ppo · 2026-05-28T08:54:59Z

Motivation

Test Plan

Test Result

for small conc like 8, basically same:
PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  63.20     
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              0.51      
Output token throughput (tok/s):         518.49    
Peak output token throughput (tok/s):    544.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1036.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          291.49    
Median TTFT (ms):                        301.98    
P99 TTFT (ms):                           308.53    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.16     
Median TPOT (ms):                        15.15     
P99 TPOT (ms):                           15.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.14     
Median ITL (ms):                         15.16     
P99 ITL (ms):                            16.16     
==================================================

before

=========== Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  62.68     
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              0.51      
Output token throughput (tok/s):         522.79    
Peak output token throughput (tok/s):    552.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1045.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          288.79    
Median TTFT (ms):                        277.55    
P99 TTFT (ms):                           367.73    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.03     
Median TPOT (ms):                        15.03     
P99 TPOT (ms):                           15.15     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.02     
Median ITL (ms):                         15.04     
P99 ITL (ms):                            15.84     
==================================================

for large conc like 64, small improvement:
this PR

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  136.14    
Total input tokens:                      2097152   
Total generated tokens:                  262144    
Request throughput (req/s):              1.88      
Output token throughput (tok/s):         1925.52   
Peak output token throughput (tok/s):    3456.00   
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          17329.64  
---------------Time to First Token----------------
Mean TTFT (ms):                          7411.25   
Median TTFT (ms):                        7397.39   
P99 TTFT (ms):                           14209.87  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.02     
Median TPOT (ms):                        25.99     
P99 TPOT (ms):                           32.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.99     
Median ITL (ms):                         19.35     
P99 ITL (ms):                            21.45     
==================================================

before

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  139.50    
Total input tokens:                      2097152   
Total generated tokens:                  262144    
Request throughput (req/s):              1.84      
Output token throughput (tok/s):         1879.18   
Peak output token throughput (tok/s):    4022.00   
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          16912.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          7486.73   
Median TTFT (ms):                        7466.13   
P99 TTFT (ms):                           14321.71  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.76     
Median TPOT (ms):                        26.78     
P99 TPOT (ms):                           33.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.73     
Median ITL (ms):                         20.10     
P99 ITL (ms):                            21.43     
==================================================

64 decode:
bf16 einsum

fp8 einsum

prefill 16384:
bf16 einsum

fp8_einsum

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-05-28T08:55:22Z

+    cos_mask = s_mask[:, None]
+    cos = tl.load(cos_ptr + cos_offs, mask=cos_mask, other=0.0)
+    sin = tl.load(sin_ptr + cos_offs, mask=cos_mask, other=0.0)
+    even_mask = (rope_d_offs % 2 == 0)[None, :]


⚠️ [ruff] <F841> _{reported by reviewdog 🐶}
Local variable even_mask is assigned to but never used

Suggested change

even_mask = (rope_d_offs % 2 == 0)[None, :]

(rope_d_offs % 2 == 0)[None, :]

Signed-off-by: ganyi <ygan@amd.com>

github-actions Bot reviewed May 28, 2026

View reviewed changes

integrate fp8 einsum to deepseekv4

7503fdd

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo force-pushed the ganyi/deepseek_v4_einsum branch from 17a2738 to 7503fdd Compare May 28, 2026 08:56

ganyi1996ppo added 2 commits May 29, 2026 10:04

add dispatch for inv_rope and einsum

2d5d5eb

Signed-off-by: ganyi <ygan@amd.com>

format

46d7e8b

Signed-off-by: ganyi <ygan@amd.com>

This was referenced May 30, 2026

ATOM Development Roadmap (2026 Q2) sunway513/ATOM#63

Closed

ATOM Development Roadmap (2026 Q2) #988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek v4 fp8_einsum enable#960

deepseek v4 fp8_einsum enable#960
ganyi1996ppo wants to merge 3 commits into
mainfrom
ganyi/deepseek_v4_einsum

ganyi1996ppo commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	even_mask = (rope_d_offs % 2 == 0)[None, :]
	(rope_d_offs % 2 == 0)[None, :]

Conversation

ganyi1996ppo commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ganyi1996ppo commented May 28, 2026 •

edited

Loading