[Feat]Support broadcast on ascend #503

harrisonyhq · 2025-12-10T09:03:20Z

Purpose

Support broadcast on ascend due to different shape of mla model k cache (nope, rope);
Avoid create empty tensors everytime when need to broadcast, use buffer to decrease create time cost.

See file changes.

Model	Platform	TP	Input length	Concurrency	Broadcast	Odirect	TTFT
DeepSeek-V2-Lite	CUDA	4	4K	50	False	False	855 ms
DeepSeek-V2-Lite	CUDA	4	4K	50	True	False	523 ms

[Feat]Support broadcast on ascend

7836ab7

harrisonyhq requested review from hek14, mag1c-h, qyh111 and ygwpz as code owners December 10, 2025 09:03