[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

yewentao256 · 2025-11-27T00:15:42Z

Purpose

We cancel the update in #25093 because of the cache hit issue, and seems already be fixed in main, now we can enable cudagraph as default for deepEPHT with the split of moe ops.

Test

vllm serve deepseek-ai/DeepSeek-V3.1 -dp 8 --enable-expert-parallel --port 9256

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9575|±  |0.0056|

Perf

vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024

# now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  34.20     
Total input tokens:                      1024      
Total generated tokens:                  262144    
Request throughput (req/s):              29.94     
Output token throughput (tok/s):         7665.18   
Peak output token throughput (tok/s):    8192.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          7695.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          1195.77   
Median TTFT (ms):                        1248.90   
P99 TTFT (ms):                           1272.42   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.00    
Median TPOT (ms):                        128.99    
P99 TPOT (ms):                           129.19    
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.00    
Median ITL (ms):                         129.15    
P99 ITL (ms):                            167.09    
==================================================

# main (cuda graph be set to None)
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  36.03     
Total input tokens:                      1024      
Total generated tokens:                  262144    
Request throughput (req/s):              28.42     
Output token throughput (tok/s):         7275.19   
Peak output token throughput (tok/s):    8192.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          7303.61   
---------------Time to First Token----------------
Mean TTFT (ms):                          1248.72   
Median TTFT (ms):                        1256.96   
P99 TTFT (ms):                           1411.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.76    
Median TPOT (ms):                        135.77    
P99 TPOT (ms):                           136.01    
---------------Inter-token Latency----------------
Mean ITL (ms):                           135.76    
Median ITL (ms):                         135.62    
P99 ITL (ms):                            172.30    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request re-enables CUDA graph support for the deepep_high_throughput backend when data parallelism is used (dp_size > 1), which was previously disabled due to a cache hit issue. The change introduces logic to conditionally add MoE-related operations (vllm::moe_forward, vllm::moe_forward_shared) to the list of splitting_ops in the compilation configuration. This allows these operations to be excluded from the main CUDA graph, making them compatible with CUDA graph capture. The changes are well-targeted, guarded by appropriate conditions, and include the removal of the old code that disabled this feature. The implementation appears correct and aligns with the goal of improving performance, as demonstrated by the benchmark results in the description.

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/config/compilation.py

Lines 825 to 827 in 0c58593

    
           if self.use_inductor_graph_partition: 
        
               self.set_splitting_ops_for_inductor_graph_partition() 
        
               return

Split DeepEP MoE ops when using inductor partition

For the DeepEP high-throughput backend with data-parallel >1, CUDA graphs are now enabled after removing the guard in vllm/platforms/cuda.py, but this early return prevents the new MoE split logic below from running when use_inductor_graph_partition is set. In that configuration the DeepEP MoE kernels remain inside captured CUDA graphs, recreating the incompatibility that the MoE splits were meant to avoid. The inductor-partition path should also mark the MoE ops as splitting ops before returning.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ProExpertProg

Thanks for implementing this, just make sure to handle the inductor partition case! Also could you add unit tests for the different cases?

ProExpertProg · 2025-11-27T00:18:00Z

vllm/config/compilation.py

                self.cudagraph_mode = CUDAGraphMode.FULL
            self.splitting_ops = []

+        # split moe op for cudagraph


This will not apply to the cases above (inductor partition or attn fusion)

enable cuda graph for deepepHT

0c58593

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth and youkaichao as code owners November 27, 2025 00:15

mergify bot added the nvidia label Nov 27, 2025

github-project-automation bot added this to NVIDIA Nov 27, 2025

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025

chatgpt-codex-connector bot reviewed Nov 27, 2025

View reviewed changes

ProExpertProg reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

yewentao256 commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

ProExpertProg left a comment

Uh oh!

ProExpertProg Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if self.use_inductor_graph_partition:
	self.set_splitting_ops_for_inductor_graph_partition()
	return

Uh oh!

[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

Are you sure you want to change the base?

[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

Conversation

yewentao256 commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Acc

Perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yewentao256 commented Nov 27, 2025 •

edited by github-actions bot

Loading