-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request re-enables CUDA graph support for the deepep_high_throughput backend when data parallelism is used (dp_size > 1), which was previously disabled due to a cache hit issue. The change introduces logic to conditionally add MoE-related operations (vllm::moe_forward, vllm::moe_forward_shared) to the list of splitting_ops in the compilation configuration. This allows these operations to be excluded from the main CUDA graph, making them compatible with CUDA graph capture. The changes are well-targeted, guarded by appropriate conditions, and include the removal of the old code that disabled this feature. The implementation appears correct and aligns with the goal of improving performance, as demonstrated by the benchmark results in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/config/compilation.py
Lines 825 to 827 in 0c58593
| if self.use_inductor_graph_partition: | |
| self.set_splitting_ops_for_inductor_graph_partition() | |
| return |
For the DeepEP high-throughput backend with data-parallel >1, CUDA graphs are now enabled after removing the guard in vllm/platforms/cuda.py, but this early return prevents the new MoE split logic below from running when use_inductor_graph_partition is set. In that configuration the DeepEP MoE kernels remain inside captured CUDA graphs, recreating the incompatibility that the MoE splits were meant to avoid. The inductor-partition path should also mark the MoE ops as splitting ops before returning.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this, just make sure to handle the inductor partition case! Also could you add unit tests for the different cases?
| self.cudagraph_mode = CUDAGraphMode.FULL | ||
| self.splitting_ops = [] | ||
|
|
||
| # split moe op for cudagraph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not apply to the cases above (inductor partition or attn fusion)
Purpose
We cancel the update in #25093 because of the cache hit issue, and seems already be fixed in main, now we can enable cudagraph as default for deepEPHT with the split of moe ops.
Test
vllm serve deepseek-ai/DeepSeek-V3.1 -dp 8 --enable-expert-parallel --port 9256Acc
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8kPerf
vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024