-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Proposal to improve performance
No response
Report of performance regression
I compared the performance between UCM and vLLM baseline with the benchmark provided in the repo, and found the throughput dropped significantly.
Baseline: 18246.54 token/s
UCM: 1745.75 token/s
Hardware: H100 GPU X2
Is my configuration correct?
Commit: 28f6f35
Prefill kv transfer config:
{
"kv_connector": "UnifiedCacheConnectorV1",
"kv_connector_module_path": "ucm.integration.vllm.uc_connector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"ucm_connector_name": "UcmDramStore",
"ucm_connector_config": {
"max_cache_size": 5368709120,
"kv_block_size": 262144
}
}
}
Decode kv transfer config:
{
"kv_connector": "UnifiedCacheConnectorV1",
"kv_connector_module_path": "ucm.integration.vllm.uc_connector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"ucm_connector_name": "UcmDramStore",
"ucm_connector_config": {
"max_cache_size": 5368709120,
"kv_block_size": 262144
}
},
"ucm_sparse_config": {
"GSA": {}
}
}
vLLM server args (prefill and decode use the same server args):
--model /models/Qwen3-30B-A3B-Instruct-2507 --max-model-len 80000 --trust-remote-code --gpu_memory_utilization 0.9 --enforce-eager --no-enable-prefix-caching --block-size 128 --dtype bfloat16 --tensor-parallel-size 1
proxy server command:
python3 toy_proxy_server.py --host localhost --port 43215 --prefiller-host localhost --prefiller-port 43210 --pd-disaggregation --decoder-host localhost --decoder-port 43211
Benchmark launching command:
python3 trace_replay.py
--model "/models/Qwen3-30B-A3B-Instruct-2507"
--backend vllm
--trace-path FAST25-release/traces/conversation_trace.jsonl
--trace-mode trace
--host 127.0.0.1
--port 43215
--save-result
--save-prompts
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`