-
Notifications
You must be signed in to change notification settings - Fork 609
Open
Labels
RFCRequest For CommentsRequest For Comments
Description
Motivation.
For existing end-to-end (E2E) and nightly tests are extremely long e2e(almost 139min) and nightly-test: 4h51min, we must find ways to reduce test duration to improve the developer experience.
Proposed Change.
- Speed up load weight time:
As [Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre vllm#24469 show, the methods for loading models using--safetensors-load-strategy eager, It can greatly improve loading speed (if your storage is network storage, such as NFS).
and I also have a local experiment: Loading the Qwen3-32B model from sfs_turbo, the results are as follows.
# do not use eager mode(default lazy)
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:03<00:50, 3.13s/it]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:08<01:09, 4.63s/it]
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:14<01:11, 5.09s/it]
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:20<01:09, 5.38s/it]
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:24<00:58, 4.87s/it]
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:30<00:57, 5.21s/it]
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:36<00:55, 5.58s/it]
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:42<00:51, 5.67s/it]
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:48<00:46, 5.77s/it]
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:54<00:40, 5.82s/it]
Loading safetensors checkpoint shards: 65% Completed | 11/17 [01:00<00:34, 5.82s/it]
Loading safetensors checkpoint shards: 71% Completed | 12/17 [01:05<00:29, 5.82s/it]
Loading safetensors checkpoint shards: 76% Completed | 13/17 [01:10<00:21, 5.49s/it]
Loading safetensors checkpoint shards: 82% Completed | 14/17 [01:16<00:16, 5.60s/it]
Loading safetensors checkpoint shards: 88% Completed | 15/17 [01:21<00:11, 5.54s/it]
Loading safetensors checkpoint shards: 94% Completed | 16/17 [01:27<00:05, 5.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [01:33<00:00, 5.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [01:33<00:00, 5.51s/it]
(Worker_TP0 pid=1199)
(Worker_TP0 pid=1199) INFO 11-12 06:57:28 [default_loader.py:267] Loading weights took 93.78 seconds
(Worker_TP3 pid=1202) INFO 11-12 06:57:28 [default_loader.py:267] Loading weights took 95.21 seconds
(Worker_TP2 pid=1201) INFO 11-12 06:57:28 [default_loader.py:267] Loading weights took 97.60 seconds
(Worker_TP1 pid=1200) INFO 11-12 06:57:28 [default_loader.py:267] Loading weights took 96.35 seconds
# using eager mode
loading safetensors checkpoint shards (eager): 0% Completed | 0/17 [00:00<?, ?it/s]
(Worker_TP1 pid=2940) Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B
Loading safetensors checkpoint shards (eager): 6% Completed | 1/17 [00:01<00:17, 1.11s/it]
Loading safetensors checkpoint shards (eager): 12% Completed | 2/17 [00:02<00:17, 1.17s/it]
Loading safetensors checkpoint shards (eager): 18% Completed | 3/17 [00:03<00:16, 1.17s/it]
Loading safetensors checkpoint shards (eager): 24% Completed | 4/17 [00:04<00:15, 1.16s/it]
Loading safetensors checkpoint shards (eager): 29% Completed | 5/17 [00:05<00:12, 1.07s/it]
Loading safetensors checkpoint shards (eager): 35% Completed | 6/17 [00:06<00:12, 1.11s/it]
Loading safetensors checkpoint shards (eager): 41% Completed | 7/17 [00:07<00:11, 1.15s/it]
Loading safetensors checkpoint shards (eager): 47% Completed | 8/17 [00:09<00:10, 1.16s/it]
Loading safetensors checkpoint shards (eager): 53% Completed | 9/17 [00:10<00:09, 1.18s/it]
Loading safetensors checkpoint shards (eager): 59% Completed | 10/17 [00:11<00:08, 1.21s/it]
Loading safetensors checkpoint shards (eager): 65% Completed | 11/17 [00:12<00:07, 1.21s/it]
Loading safetensors checkpoint shards (eager): 71% Completed | 12/17 [00:14<00:06, 1.21s/it]
Loading safetensors checkpoint shards (eager): 76% Completed | 13/17 [00:15<00:04, 1.20s/it]
Loading safetensors checkpoint shards (eager): 82% Completed | 14/17 [00:16<00:03, 1.21s/it]
(Worker_TP2 pid=2941) INFO 11-12 06:59:25 [default_loader.py:267] Loading weights took 19.97 seconds
Loading safetensors checkpoint shards (eager): 88% Completed | 15/17 [00:17<00:02, 1.21s/it]
(Worker_TP3 pid=2942) INFO 11-12 06:59:26 [default_loader.py:267] Loading weights took 19.76 seconds
Loading safetensors checkpoint shards (eager): 94% Completed | 16/17 [00:18<00:01, 1.16s/it]
Loading safetensors checkpoint shards (eager): 100% Completed | 17/17 [00:19<00:00, 1.12s/it]
Loading safetensors checkpoint shards (eager): 100% Completed | 17/17 [00:19<00:00, 1.16s/it]
(Worker_TP0 pid=2939)
(Worker_TP0 pid=2939) INFO 11-12 06:59:27 [default_loader.py:267] Loading weights took 19.78 seconds
(Worker_TP1 pid=2940) INFO 11-12 06:59:28 [default_loader.py:267] Loading weights took 19.29 secondsas the result can see, loading speed is four times faster.
- nightly test using
quay.io/ascend/vllm-ascend:main
Since nightly tests are less sensitive to real-time performance, we can consider using the latest image from the main branch directly. see [CI] Add daily images build for nightly ci #3989 and [CI] Integrate mooncake to vllm-ascend base image #4062
Feedback Period.
No response
CC List.
@Yikun @zhangxinyuehfad @leo-pony @wangxiyuan
Any Other Things.
No response
zhangxinyuehfad, MengqingCao and menogrey
Metadata
Metadata
Assignees
Labels
RFCRequest For CommentsRequest For Comments