Skip to content

Commit 783ea0f

Browse files
Lijiachen1018ygwpzharrisonyhqqyh111lijiachen19
authored
[rebase]Dev-ucm-v1 rebase to develop (#453)
* [opt] refactor uc connector (#364) refactor ucm_connector * [Feat] Implement kv cache broadcast in MLA (#367) * [Feat] Implement kv cache broadcast in MLA in ucm_connector * [Style] Change wait for broadcast into single task method * [feature] add ucm mock connector (#375) * add ucm mock connector * fix chunk prefill bug * [Feat] Support get launch config from yaml (#377) * [Feat] Support launch from config file * [Docs] Update documents for launch with yaml * [Fix] Change load only on first rank into configuration * [Feat] Add support for hit ratio in yaml * [Fix] Fix load only first rank in non mla scene * [fix] refuse monkey patch (#383) refuse monkey patch * [bugfix] fix gqa bug (#384) fix gqa bug * [bugfix] end == 0 bug (#385) fix end == 0 bug * [feature] optimize generate_tensor (#396) optimize generate_tensor * [Fix] fix mla bug when no broadcast in wait for save (#398) * [feat]adapt GQA & modify config.yaml (#407) * adapt GQA & modify config.yaml * move process to UCMDirectConnector * fix comment * modify hash function * fix style * code style and modify hash * init parent_block_hash_value * [feat]Adapt vllm_ascend_0110 and Add configurable options (#415) * Adapt vllm_ascend_0110 and Add configurable options * avoid type conversion in init kvcache * [patch]seprate sparse patch (#417) seprate spase patch Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * [bugfix]Support tensor parallelism across servers (#420) Support tensor parallelism across servers * [Feat] UCM supports metrics display online via Grafana and Promethues (#414) * [Feat] Build metrics frame * [Feat]add metrics(ucm_obser.py + metrics_configs.yaml) * [Feat] Implementation of metrics logger on the C++ side for storing and retrieving stats * [Fix] Provide simple grafana and fix bugs * [feat] change the log position of UCM metrics * [fix]modify grafana.json * [Feat] UCM supports metrics display online via Grafana and Promethues * [Fix] Remove configs to examples and add liscense --------- Co-authored-by: flesher0813 <1208954694@qq.com> Co-authored-by: hero<tianxuehan@huawei.com> * [feat]Merge develop to dev-ucm-v1 and fix code style (#428) * [fix] fix sparse attention (#397) fix ascend attention Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * [opt] Share Infra implementation and unify status codes (#399) share infra module Co-authored-by: Fang Run <Fang_Run@126.com> * [bugfix] Fix ESA to be compatible with the latest NFSStore. (#401) fix esa to adapt latest NFSStore * release v0.1.0rc4 (#402) Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * [opt] Remove unused cc impl of dramstore (#406) remove unused cc impl of dramstore * [Fix]remove dram docs and modify quick-start doc (#411) * [Fix]remove dram docs and modify quick-start doc * modify index.md --------- Co-authored-by: t00939662 <tianxuehan@huawei.com> * [Feature] Added performance testing tool based on the PyTest testing framework (#295) Performance testing tool based on the PyTest testing framework. * [Misc] Add cpp-linter.yml (#422) * [docs]add metrics doc (#416) * [docs]add metrics doc * modify metrics.md * modify metrics.md --------- Co-authored-by: t00939662 <tianxuehan@huawei.com> * [perf] Modify CUDA SIMD and add Triton hash encoder (#408) * fix cpp code style --------- Co-authored-by: Lijiachen1018 <30387633+Lijiachen1018@users.noreply.github.com> Co-authored-by: lijiachen19 <lijiachen19@huawei.com> Co-authored-by: Mag1c.H <hemajun815@163.com> Co-authored-by: Fang Run <Fang_Run@126.com> Co-authored-by: MaxWang <wangwenxin21@huawei.com> Co-authored-by: hero0307 <tianxuehan0307@163.com> Co-authored-by: t00939662 <tianxuehan@huawei.com> Co-authored-by: ML <85485147+Menglths@users.noreply.github.com> Co-authored-by: ShiXiaolei <indirashi@163.com> * add env variable ENABLE_SPARSE (#430) Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * Fix(patch): fix patch for vllm-ascend (#433) Fix(patch): fix patch for vllm-ascend volcengine/verl#2564 Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * [bugfix] fix accuracy problem when chunked prefill (#438) * fix accuracy problem when chunked prefill * [bugfix]fix num_schedule-tokens=1 (#442) * fix num_schedule-tokens=1 * Simplify the code * [fix]: Fix sparse patch (#444) Fix sparse patch Co-authored-by: lijiachen <lijiachen19@huawei.com> * [bugfix] The Metrics module uses a non-existent variable self.rank (#445) * [Feature]Add an access bandwidth test script for ucm_connector (#418) * Add an access bandwidth test script for 'ucm_connector' * [bugfix]adapt vllm0.9.1 (#446) adapt vllm0.9.1 * [Fix]Set the multiprocessing start method of the test tool to 'spawn'. (#447) Set the multiprocessing start method of the test tool to 'spawn' and add NPU cleanup * [fix] Adapt all sparse-attention methods to the new connector. (#441) * sparse to adapt new connector * Adapt the YAML configuration * [docs] renew docs for v1 (#448) renew docs for v1 Co-authored-by: lijiachen19 <lijiachen19@huawei.com> * set version to 0.1.0 (#450) * [Feature] GSA adapt nfsStore (#451) * adapt nfsstore * fix codestyle --------- Co-authored-by: ygwpz <543529648@qq.com> Co-authored-by: harrisonyhq <harrisonyhq@gmail.com> Co-authored-by: qyh111 <qiuyuhao1@huawei.com> Co-authored-by: lijiachen19 <lijiachen19@huawei.com> Co-authored-by: sumingZero <58885253+sumingZero@users.noreply.github.com> Co-authored-by: flesher0813 <1208954694@qq.com> Co-authored-by: Mag1c.H <hemajun815@163.com> Co-authored-by: Fang Run <Fang_Run@126.com> Co-authored-by: MaxWang <wangwenxin21@huawei.com> Co-authored-by: hero0307 <tianxuehan0307@163.com> Co-authored-by: t00939662 <tianxuehan@huawei.com> Co-authored-by: ML <85485147+Menglths@users.noreply.github.com> Co-authored-by: ShiXiaolei <indirashi@163.com> Co-authored-by: zhou-haitao <74044944+zhou-haitao@users.noreply.github.com> Co-authored-by: zbb200819 <1130072360@qq.com>
1 parent 52fe5a7 commit 783ea0f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+5226
-2598
lines changed

docs/source/getting-started/installation_npu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ cd ..
5757
```
5858

5959
## Setup from docker
60-
Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
60+
Download the pre-built `vllm-ascend` docker image or build unified-cache-management docker image by commands below:
6161
```bash
6262
# Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
6363
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git

docs/source/getting-started/quick_start.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,17 @@ First, specify the python hash seed by:
5959
export PYTHONHASHSEED=123456
6060
```
6161

62-
Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
62+
Create a config yaml like following and save it to your own directory:
63+
```yaml
64+
# UCM Configuration File Example
65+
# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
66+
ucm_connector_name: "UcmNfsStore"
67+
68+
ucm_connector_config:
69+
storage_backends: "/mnt/test"
70+
```
71+
72+
Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path:
6373
6474
```bash
6575
# Change the model path to your own model path
@@ -73,14 +83,11 @@ vllm serve ${MODEL_PATH} \
7383
--port 7800 \
7484
--kv-transfer-config \
7585
'{
76-
"kv_connector": "UnifiedCacheConnectorV1",
77-
"kv_connector_module_path": "ucm.integration.vllm.uc_connector",
86+
"kv_connector": "UCMConnector",
87+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
7888
"kv_role": "kv_both",
7989
"kv_connector_extra_config": {
80-
"ucm_connector_name": "UcmNfsStore",
81-
"ucm_connector_config": {
82-
"storage_backends": "/home/test"
83-
}
90+
"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"
8491
}
8592
}'
8693
```

docs/source/user-guide/prefix-cache/nfs_store.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,15 @@ To use the NFS connector, you need to configure the `connector_config` dictionar
8787

8888
### Example:
8989

90-
```python
91-
kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "transferStreamNumber": 32}}
90+
Create a config yaml like following and save it to your own directory:
91+
```yaml
92+
# UCM Configuration File Example
93+
# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
94+
ucm_connector_name: "UcmNfsStore"
95+
96+
ucm_connector_config:
97+
storage_backends: "/mnt/test"
98+
transferStreamNumber: 32
9299
```
93100
94101
## Launching Inference
@@ -101,7 +108,7 @@ To start **offline inference** with the NFS connector,modify the script `examp
101108
# In examples/offline_inference.py
102109
ktc = KVTransferConfig(
103110
...
104-
kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "transferStreamNumber": 32}}
111+
kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
105112
)
106113
```
107114

@@ -131,13 +138,7 @@ vllm serve /home/models/Qwen2.5-14B-Instruct \
131138
"kv_connector": "UnifiedCacheConnectorV1",
132139
"kv_connector_module_path": "ucm.integration.vllm.uc_connector",
133140
"kv_role": "kv_both",
134-
"kv_connector_extra_config": {
135-
"ucm_connector_name": "UcmNfsStore",
136-
"ucm_connector_config": {
137-
"storage_backends": "/mnt/test",
138-
"transferStreamNumber":32
139-
}
140-
}
141+
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
141142
}'
142143
```
143144

docs/source/user-guide/sparse-attention/esa.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ ESA provides developers with an intuitive example of how to implement their own
99
### Basic Usage
1010
ESA can be launched using the following command:
1111
```shell
12+
export ENABLE_SPARSE=TRUE
1213
export MODEL_PATH="/path/to/model" # For example: /home/models/Qwen2.5-14B-Instruct
1314
export DATASET_PATH="/path/to/longbench/multifieldqa_zh.jsonl" # For example: /home/data/Longbench/data/multifieldqa_zh.jsonl
1415
python examples/offline_inference_esa.py

docs/source/user-guide/sparse-attention/gsa.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,8 @@ ktc = KVTransferConfig(
107107
Thus, an example command for launching the online LLM service is as follows:
108108

109109
```shell
110+
export ENABLE_SPARSE=TRUE
111+
110112
vllm serve /home/models/DeepSeek-R1-Distill-Qwen-32B \
111113
--served-model-name DeepSeek-R1-Distill-Qwen-32B \
112114
--max-model-len 131000 \

docs/source/user-guide/sparse-attention/kvcomp.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ This design ensures both **efficiency** and **accuracy** by preserving essential
9797
KVComp is part of the UCM Sparse Attention module. For installation instructions, please refer to the [UCM's top-level README](https://github.com/ModelEngine-Group/unified-cache-management). Once UCM is installed, KVComp is naturally supported by running the following example python scripts.
9898

9999
```bash
100+
export ENABLE_SPARSE=TRUE
100101
python ucm/sandbox/sparse/kvcomp/offline_inference_kvcomp.py
101102
```
102103

docs/source/user-guide/sparse-attention/kvstar.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ For long-sequence inference, KVstar achieves the following with minimal accuracy
3232
### Basic Usage
3333
KVstar can be launched using the following command:
3434
```shell
35+
export ENABLE_SPARSE=TRUE
3536
export MODEL_PATH="/path/to/model" # For example: /home/models/Qwen2.5-14B-Instruct
3637
export DATASET_PATH="/path/to/longbench/multifieldqa_zh.jsonl" # For example: /home/data/Longbench/data/multifieldqa_zh.jsonl
3738
export DATA_DIR="/path/to/data"

0 commit comments

Comments
 (0)