-
Notifications
You must be signed in to change notification settings - Fork 618
[0.11.0][BugFix] Improve the performance of prefixcache features #4021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.11.0][BugFix] Improve the performance of prefixcache features #4021
Conversation
Signed-off-by: underfituu <hzhucong@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by avoiding unnecessary device synchronization in prefix cache handling for Ascend NPUs. The changes primarily involve adjusting the device placement of tensors before they are passed to NPU kernels.
The modifications in vllm_ascend/attention/mla_v1.py correctly move a tensor to the NPU earlier to prevent a blocking transfer. The changes in vllm_ascend/torchair/torchair_mla.py remove an explicit tensor transfer, relying on the kernel to handle it, which also addresses the synchronization issue.
My review identifies a potential missed optimization in vllm_ascend/torchair/torchair_mla.py where a tensor is still being created on the CPU and passed to an NPU kernel, which could cause similar performance issues. I've provided a suggestion to align it with the approach taken in mla_v1.py.
vllm_ascend/torchair/torchair_mla.py
Outdated
| seq_len_chunk = prefill_metadata.chunked_context.chunk_seq_lens[i] | ||
| seq_len = torch.stack([seq_len_base, seq_len_chunk]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensor seq_len is constructed on the CPU and subsequently passed to the torch_npu.atb.npu_ring_mla kernel at line 824. This might lead to implicit synchronization and performance degradation, which is similar to the issue this PR aims to fix for npu_paged_cache_load.
In vllm_ascend/attention/mla_v1.py, a similar pattern was addressed by ensuring the tensor is on the NPU device before being used in the kernel. To maintain consistency and prevent potential performance bottlenecks, seq_len should be moved to the NPU.
I suggest moving the tensor to the correct device during its creation.
| seq_len_chunk = prefill_metadata.chunked_context.chunk_seq_lens[i] | |
| seq_len = torch.stack([seq_len_base, seq_len_chunk]) | |
| seq_len_chunk = prefill_metadata.chunked_context.chunk_seq_lens[i] | |
| seq_len = torch.stack([seq_len_base, seq_len_chunk]).to(q_nope.device, non_blocking=True) |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: underfituu <hzhucong@163.com>
b5b624e to
8269e8e
Compare
8269e8e to
1a01fc0
Compare
1a01fc0 to
57de640
Compare
|
cherry pick:#4022 (main) |
What this PR does / why we need it?
The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream.
Does this PR introduce any user-facing change?
How was this patch tested?