ModelEngine-Group
diff --git a/‎.clang-format‎
Lines changed: 1 addition & 1 deletion b/‎.clang-format‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/cpp-linter.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/cpp-linter.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/workflows/ucmstore.yml‎
Lines changed: 4 additions & 12 deletions b/‎.github/workflows/ucmstore.yml‎
Lines changed: 4 additions & 12 deletions
diff --git a/‎.github/workflows/unifiedcache_test.yml‎
Lines changed: 2 additions & 0 deletions b/‎.github/workflows/unifiedcache_test.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎MANIFEST.in‎
Lines changed: 3 additions & 7 deletions b/‎MANIFEST.in‎
Lines changed: 3 additions & 7 deletions
diff --git a/‎docs/source/_static/images/blend_scheme.jpg‎
214 KB b/‎docs/source/_static/images/blend_scheme.jpg‎
214 KB
diff --git a/‎docs/source/user-guide/sparse-attention/cacheblend.md‎
Lines changed: 109 additions & 0 deletions b/‎docs/source/user-guide/sparse-attention/cacheblend.md‎
Lines changed: 109 additions & 0 deletions
@@ -1,4 +1,4 @@
-BasedOnStyle: LLVM
+BasedOnStyle: Google
 IndentWidth: 4
 ColumnLimit: 100
 AccessModifierOffset: -4
 
@@ -4,7 +4,7 @@ on:
   push:
     branches: [ "*" ]
   pull_request:
-    branches: [ "dev*", "main", "*release" ]
+    branches: [ "dev*", "main", "*release", "feature*" ]
 
 
 jobs:
@@ -25,7 +25,7 @@ jobs:
           files-changed-only: true
           lines-changed-only: diff
           format-review: true
-          thread-comments: ${{ github.event_name == 'pull_request' && 'update' }}
+          version: 20
 
       - name: Fail fast?!
         if: steps.linter.outputs.checks-failed != 0
 
@@ -6,14 +6,14 @@ on:
   push:
     branches: [ "*" ]
   pull_request:
-    branches: [ "dev*", "main", "*release" ]
+    branches: [ "dev*", "main", "*release", "feature*" ]
 
 env:
   # Customize the CMake build type here (Release, Debug, RelWithDebInfo, etc.)
   BUILD_TYPE: Debug
 
 jobs:
-  ci:
+  cc_gtest:
     # The CMake configure and build commands are platform agnostic and should work equally well on Windows or Mac.
     # You can convert this to a matrix build if you need cross-platform coverage.
     # See: https://docs.github.com/en/free-pro-team@latest/actions/learn-github-actions/managing-complex-workflows#using-a-build-matrix
@@ -24,28 +24,20 @@ jobs:
 
     - name: Install googletest
       run: |
-        git clone https://github.com/google/googletest.git --depth=1 --branch=v1.12.0
+        git clone https://github.com/google/googletest.git --depth=1 --branch=v1.17.0
         cd googletest
         mkdir build && cd build
         cmake -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=True ..
         sudo make install -j
 
-    - name: Install mockcpp
-      run: |
-        git clone https://github.com/sinojelly/mockcpp.git --depth=1
-        cd mockcpp
-        mkdir build && cd build
-        cmake -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=True -DMOCKCPP_XUNIT="gtest" -DMOCKCPP_XUNIT_HOME=/usr/local/ ..
-        sudo make install -j
-
     - name: Configure CMake
       # Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
       # See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
       run: cmake -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}} -DBUILD_UCM_SPARSE=OFF -DBUILD_UNIT_TESTS=ON -DRUNTIME_ENVIRONMENT=simu
 
     - name: Build
       # Build your program with the given configuration
-      run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
+      run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}} -j
 
     - name: Test
       working-directory: ${{github.workspace}}/build
 
@@ -6,11 +6,13 @@ on:
       - 'main'
       - 'dev*'
       - '*release'
+      - 'feature*'
   pull_request:
     branches:
       - 'main'
       - 'dev*'
       - '*release'
+      - 'feature*'
 
 jobs:
   # gpu-test:
 
@@ -1,8 +1,4 @@
-include LICENSE
-include pyproject.toml
 include CMakeLists.txt
-include requirements.txt
-include setup.py
-
-recursive-include examples *
-recursive-include benchmarks *
+graft ucm
+graft examples
+graft benchmarks
@@ -0,0 +1,109 @@
+# CacheBlend: : Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
+<div align="center">
+
+![blend_scheme.jpg](../../_static/images/blend_scheme.jpg)
+
+**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper **
+
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
+
+</div>
+
+## 🌟 What is CacheBlend?
+
+**CacheBlend** is a cached fusion system that combines multiple pre-computed KV caches, when their corresponding texts
+are concatenated in the LLM input. By selectively recomputing the KV cache values of a small fraction of tokens, 
+CacheBlend reduces TTFT by 2.2 ~ 3.3× and increases throughput by 2.8 ~ 5× under negligible quality drop.
+### 🎯 Key Component
+
+- **🔍 Loading Controller**: the Loading Controller orchestrates which KV caches to load, where from, and how much recomputation is needed.
+- **⚡ KV Cache Store**: the KV Cache Store manages persistent storage, lookup, and eviction of precomputed KV caches keyed by text-chunk identity.
+- **🎛️ Cache Fusor**: the Fusor merges multiple chunk-level caches into one coherent, cross-attention–correct KV cache, using minimal recomputation.
+
+### 🔥 Key Results
+- **2.2 ~ 3.3× speedup** of TTFT and **2.8 ~ 5× increase** of throughput for long sequences
+- **Preserve High quality** no more than (1% ~ 3%) quality drop compared to full KV recompute
+
+## 🧠 Ucm Implementation
+
+### Native Block-Wise Chunk KV Cache Dump, Load, PostProcess and Recompute
+1. **🔐 Chunk Hash Encoding**: Similar as prefix hash encoder, hash all blocks in each chunk from the same hash meta beginning.
+2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, ucm first performs prefix cache lookup to fetch fully reused cache and then conduct chunk cache lookup to fetch the candidate cache for blending.
+3. **🎯 Delta-Rope PostProcess**: Rectify loaded chunk cache according to their position in the new request.
+3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask and attention meta according to HKVD tokens, cache miss tokens and suffix tokens, then compute their kv cache in a single model forward stage.
+4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on ucm sparse module, blend module sparse the prefill tokens not only in attention stage but also in ffn, layer stage.
+
+## 🚀 Quick Start
+
+### Installation
+
+Blend is part of the UCM Sparse Attention module. For installation instructions, please refer to the [UCM's top-level README](https://github.com/ModelEngine-Group/unified-cache-management). Once UCM is installed, Blend is naturally supported by running the following example python scripts.
+
+```bash
+export ENABLE_SPARSE=TRUE
+export DATA_DIR=/home/data/kv_cache
+export MODEL_PATH=/home/models/mistralai/Mistral-7B-Instruct-v0.2
+export BLEND_DATASET_PATH=/home/datasets/LongBench/data/2wikimqa.jsonl
+python <ucm-repo>/examples/offline_inference_blend.py
+```
+
+### Basic Usage
+Similar to UCM's `offline_inference_esa.py` examples. We only need to specify `ucm_sparse_method` to be `Blend` and specify meta config, as shown below.
+
+```python
+...
+ktc = KVTransferConfig(
+        kv_connector=name,
+        kv_connector_module_path=module_path,
+        kv_role="kv_both",
+        kv_connector_extra_config={
+            "ucm_connectors": [
+                {
+                    "ucm_connector_name": "UcmNfsStore",
+                    "ucm_connector_config": {
+                        "storage_backends": data_dir,
+                        "kv_block_size": 33554432,
+                    },
+                }
+            ],
+            "load_only_first_rank": False,
+            "ucm_sparse_config": {
+                "Blend": {
+                    "chunk_end_token_id": chunk_end_token_id,
+                    "compute_meta": {
+                        "model.layers.1.self_attn.attn": {
+                            "ratio": 0.2,
+                        },
+                    },
+                }
+            },
+            "use_layerwise": True,
+        },
+    )
+...
+```
+
+## 📊 Supported Models
+Llama-based models and Qwen-based models now are available
+
+## 🎓 Citation
+
+```bibtex
+@inproceedings{yao2025cacheblend,
+  title={CacheBlend: Fast large language model serving for RAG with cached knowledge fusion},
+  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
+  booktitle={Proceedings of the Twentieth European Conference on Computer Systems},
+  pages={94--109},
+  year={2025}
+}
+```
+
+
+---
+
+<div align="center">
+
+**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!**
+
+</div>
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-BasedOnStyle: LLVM`
	`1`	`+BasedOnStyle: Google`
`2`	`2`	`IndentWidth: 4`
`3`	`3`	`ColumnLimit: 100`
`4`	`4`	`AccessModifierOffset: -4`