Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Nov 27, 2025

Purpose

We cancel the update in #25093 because of the cache hit issue, and seems already be fixed in main, now we can enable cudagraph as default for deepEPHT with the split of moe ops.

Test

vllm serve deepseek-ai/DeepSeek-V3.1 -dp 8 --enable-expert-parallel --port 9256

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match||0.9575|±  |0.0056|

Perf

vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024

# now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  34.20     
Total input tokens:                      1024      
Total generated tokens:                  262144    
Request throughput (req/s):              29.94     
Output token throughput (tok/s):         7665.18   
Peak output token throughput (tok/s):    8192.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          7695.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          1195.77   
Median TTFT (ms):                        1248.90   
P99 TTFT (ms):                           1272.42   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.00    
Median TPOT (ms):                        128.99    
P99 TPOT (ms):                           129.19    
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.00    
Median ITL (ms):                         129.15    
P99 ITL (ms):                            167.09    
==================================================

# main (cuda graph be set to None)
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  36.03     
Total input tokens:                      1024      
Total generated tokens:                  262144    
Request throughput (req/s):              28.42     
Output token throughput (tok/s):         7275.19   
Peak output token throughput (tok/s):    8192.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          7303.61   
---------------Time to First Token----------------
Mean TTFT (ms):                          1248.72   
Median TTFT (ms):                        1256.96   
P99 TTFT (ms):                           1411.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.76    
Median TPOT (ms):                        135.77    
P99 TPOT (ms):                           136.01    
---------------Inter-token Latency----------------
Mean ITL (ms):                           135.76    
Median ITL (ms):                         135.62    
P99 ITL (ms):                            172.30    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request re-enables CUDA graph support for the deepep_high_throughput backend when data parallelism is used (dp_size > 1), which was previously disabled due to a cache hit issue. The change introduces logic to conditionally add MoE-related operations (vllm::moe_forward, vllm::moe_forward_shared) to the list of splitting_ops in the compilation configuration. This allows these operations to be excluded from the main CUDA graph, making them compatible with CUDA graph capture. The changes are well-targeted, guarded by appropriate conditions, and include the removal of the old code that disabled this feature. The implementation appears correct and aligns with the goal of improving performance, as demonstrated by the benchmark results in the description.

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if self.use_inductor_graph_partition:
self.set_splitting_ops_for_inductor_graph_partition()
return

P1 Badge Split DeepEP MoE ops when using inductor partition

For the DeepEP high-throughput backend with data-parallel >1, CUDA graphs are now enabled after removing the guard in vllm/platforms/cuda.py, but this early return prevents the new MoE split logic below from running when use_inductor_graph_partition is set. In that configuration the DeepEP MoE kernels remain inside captured CUDA graphs, recreating the incompatibility that the MoE splits were meant to avoid. The inductor-partition path should also mark the MoE ops as splitting ops before returning.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, just make sure to handle the inductor partition case! Also could you add unit tests for the different cases?

self.cudagraph_mode = CUDAGraphMode.FULL
self.splitting_ops = []

# split moe op for cudagraph
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not apply to the cases above (inductor partition or attn fusion)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants