Skip to content

Commit 0fcae8d

Browse files
committed
release 0.2.0rc1
2 parents aa31619 + 319421f commit 0fcae8d

File tree

150 files changed

+8486
-3332
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

150 files changed

+8486
-3332
lines changed

.clang-format

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
BasedOnStyle: LLVM
1+
BasedOnStyle: Google
22
IndentWidth: 4
33
ColumnLimit: 100
44
AccessModifierOffset: -4

.github/workflows/cpp-linter.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ on:
44
push:
55
branches: [ "*" ]
66
pull_request:
7-
branches: [ "dev*", "main", "*release" ]
7+
branches: [ "dev*", "main", "*release", "feature*" ]
88

99

1010
jobs:
@@ -25,7 +25,7 @@ jobs:
2525
files-changed-only: true
2626
lines-changed-only: diff
2727
format-review: true
28-
thread-comments: ${{ github.event_name == 'pull_request' && 'update' }}
28+
version: 20
2929

3030
- name: Fail fast?!
3131
if: steps.linter.outputs.checks-failed != 0

.github/workflows/ucmstore.yml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@ on:
66
push:
77
branches: [ "*" ]
88
pull_request:
9-
branches: [ "dev*", "main", "*release" ]
9+
branches: [ "dev*", "main", "*release", "feature*" ]
1010

1111
env:
1212
# Customize the CMake build type here (Release, Debug, RelWithDebInfo, etc.)
1313
BUILD_TYPE: Debug
1414

1515
jobs:
16-
ci:
16+
cc_gtest:
1717
# The CMake configure and build commands are platform agnostic and should work equally well on Windows or Mac.
1818
# You can convert this to a matrix build if you need cross-platform coverage.
1919
# See: https://docs.github.com/en/free-pro-team@latest/actions/learn-github-actions/managing-complex-workflows#using-a-build-matrix
@@ -24,28 +24,20 @@ jobs:
2424

2525
- name: Install googletest
2626
run: |
27-
git clone https://github.com/google/googletest.git --depth=1 --branch=v1.12.0
27+
git clone https://github.com/google/googletest.git --depth=1 --branch=v1.17.0
2828
cd googletest
2929
mkdir build && cd build
3030
cmake -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=True ..
3131
sudo make install -j
3232
33-
- name: Install mockcpp
34-
run: |
35-
git clone https://github.com/sinojelly/mockcpp.git --depth=1
36-
cd mockcpp
37-
mkdir build && cd build
38-
cmake -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=True -DMOCKCPP_XUNIT="gtest" -DMOCKCPP_XUNIT_HOME=/usr/local/ ..
39-
sudo make install -j
40-
4133
- name: Configure CMake
4234
# Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
4335
# See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
4436
run: cmake -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}} -DBUILD_UCM_SPARSE=OFF -DBUILD_UNIT_TESTS=ON -DRUNTIME_ENVIRONMENT=simu
4537

4638
- name: Build
4739
# Build your program with the given configuration
48-
run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
40+
run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}} -j
4941

5042
- name: Test
5143
working-directory: ${{github.workspace}}/build

.github/workflows/unifiedcache_test.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ on:
66
- 'main'
77
- 'dev*'
88
- '*release'
9+
- 'feature*'
910
pull_request:
1011
branches:
1112
- 'main'
1213
- 'dev*'
1314
- '*release'
15+
- 'feature*'
1416

1517
jobs:
1618
# gpu-test:

MANIFEST.in

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
1-
include LICENSE
2-
include pyproject.toml
31
include CMakeLists.txt
4-
include requirements.txt
5-
include setup.py
6-
7-
recursive-include examples *
8-
recursive-include benchmarks *
2+
graft ucm
3+
graft examples
4+
graft benchmarks
214 KB
Loading
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# CacheBlend: : Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
2+
<div align="center">
3+
4+
![blend_scheme.jpg](../../_static/images/blend_scheme.jpg)
5+
6+
**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper **
7+
8+
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
9+
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
10+
11+
</div>
12+
13+
## 🌟 What is CacheBlend?
14+
15+
**CacheBlend** is a cached fusion system that combines multiple pre-computed KV caches, when their corresponding texts
16+
are concatenated in the LLM input. By selectively recomputing the KV cache values of a small fraction of tokens,
17+
CacheBlend reduces TTFT by 2.2 ~ 3.3× and increases throughput by 2.8 ~ 5× under negligible quality drop.
18+
### 🎯 Key Component
19+
20+
- **🔍 Loading Controller**: the Loading Controller orchestrates which KV caches to load, where from, and how much recomputation is needed.
21+
- **⚡ KV Cache Store**: the KV Cache Store manages persistent storage, lookup, and eviction of precomputed KV caches keyed by text-chunk identity.
22+
- **🎛️ Cache Fusor**: the Fusor merges multiple chunk-level caches into one coherent, cross-attention–correct KV cache, using minimal recomputation.
23+
24+
### 🔥 Key Results
25+
- **2.2 ~ 3.3× speedup** of TTFT and **2.8 ~ 5× increase** of throughput for long sequences
26+
- **Preserve High quality** no more than (1% ~ 3%) quality drop compared to full KV recompute
27+
28+
## 🧠 Ucm Implementation
29+
30+
### Native Block-Wise Chunk KV Cache Dump, Load, PostProcess and Recompute
31+
1. **🔐 Chunk Hash Encoding**: Similar as prefix hash encoder, hash all blocks in each chunk from the same hash meta beginning.
32+
2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, ucm first performs prefix cache lookup to fetch fully reused cache and then conduct chunk cache lookup to fetch the candidate cache for blending.
33+
3. **🎯 Delta-Rope PostProcess**: Rectify loaded chunk cache according to their position in the new request.
34+
3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask and attention meta according to HKVD tokens, cache miss tokens and suffix tokens, then compute their kv cache in a single model forward stage.
35+
4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on ucm sparse module, blend module sparse the prefill tokens not only in attention stage but also in ffn, layer stage.
36+
37+
## 🚀 Quick Start
38+
39+
### Installation
40+
41+
Blend is part of the UCM Sparse Attention module. For installation instructions, please refer to the [UCM's top-level README](https://github.com/ModelEngine-Group/unified-cache-management). Once UCM is installed, Blend is naturally supported by running the following example python scripts.
42+
43+
```bash
44+
export ENABLE_SPARSE=TRUE
45+
export DATA_DIR=/home/data/kv_cache
46+
export MODEL_PATH=/home/models/mistralai/Mistral-7B-Instruct-v0.2
47+
export BLEND_DATASET_PATH=/home/datasets/LongBench/data/2wikimqa.jsonl
48+
python <ucm-repo>/examples/offline_inference_blend.py
49+
```
50+
51+
### Basic Usage
52+
Similar to UCM's `offline_inference_esa.py` examples. We only need to specify `ucm_sparse_method` to be `Blend` and specify meta config, as shown below.
53+
54+
```python
55+
...
56+
ktc = KVTransferConfig(
57+
kv_connector=name,
58+
kv_connector_module_path=module_path,
59+
kv_role="kv_both",
60+
kv_connector_extra_config={
61+
"ucm_connectors": [
62+
{
63+
"ucm_connector_name": "UcmNfsStore",
64+
"ucm_connector_config": {
65+
"storage_backends": data_dir,
66+
"kv_block_size": 33554432,
67+
},
68+
}
69+
],
70+
"load_only_first_rank": False,
71+
"ucm_sparse_config": {
72+
"Blend": {
73+
"chunk_end_token_id": chunk_end_token_id,
74+
"compute_meta": {
75+
"model.layers.1.self_attn.attn": {
76+
"ratio": 0.2,
77+
},
78+
},
79+
}
80+
},
81+
"use_layerwise": True,
82+
},
83+
)
84+
...
85+
```
86+
87+
## 📊 Supported Models
88+
Llama-based models and Qwen-based models now are available
89+
90+
## 🎓 Citation
91+
92+
```bibtex
93+
@inproceedings{yao2025cacheblend,
94+
title={CacheBlend: Fast large language model serving for RAG with cached knowledge fusion},
95+
author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
96+
booktitle={Proceedings of the Twentieth European Conference on Computer Systems},
97+
pages={94--109},
98+
year={2025}
99+
}
100+
```
101+
102+
103+
---
104+
105+
<div align="center">
106+
107+
**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!**
108+
109+
</div>

0 commit comments

Comments
 (0)