Skip to content

Commit d4f82a5

Browse files
authored
Merge branch 'develop' into dev_1211
2 parents de4a445 + 010844e commit d4f82a5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+4499
-174
lines changed

MANIFEST.in

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
include LICENSE
2-
include pyproject.toml
31
include CMakeLists.txt
4-
include requirements.txt
5-
6-
recursive-include examples *
7-
recursive-include benchmarks *
2+
graft ucm
3+
graft examples
4+
graft benchmarks

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ in either a local filesystem for single-machine scenarios or through NFS mount p
6868

6969
## Quick Start
7070

71-
please refer to [Quick Start](https://ucm.readthedocs.io/en/latest/getting-started/quick_start.html).
71+
please refer to [Quick Start for vLLM](https://ucm.readthedocs.io/en/latest/getting-started/quickstart_vllm.html) and [Quick Start for vLLM-Ascend](https://ucm.readthedocs.io/en/latest/getting-started/quickstart_vllm_ascend.html).
7272

7373
---
7474

docs/source/getting-started/quickstart_vllm.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,6 @@ Then run following commands:
135135
```bash
136136
cd examples/
137137
# Change the model path to your own model path
138-
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
139138
python offline_inference.py
140139
```
141140

@@ -163,12 +162,14 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
163162
--kv-transfer-config \
164163
'{
165164
"kv_connector": "UCMConnector",
165+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
166166
"kv_role": "kv_both",
167-
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
167+
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
168168
}'
169169
```
170+
**⚠️ The parameter `--no-enable-prefix-caching` is for SSD performance testing, please remove it for production.**
170171

171-
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
172+
**⚠️ Make sure to replace `"/workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
172173

173174

174175
If you see log as below:
@@ -187,7 +188,7 @@ After successfully started the vLLM server,You can interact with the API as fo
187188
curl http://localhost:7800/v1/completions \
188189
-H "Content-Type: application/json" \
189190
-d '{
190-
"model": "/home/models/Qwen2.5-14B-Instruct",
191+
"model": "Qwen/Qwen2.5-14B-Instruct",
191192
"prompt": "You are a highly specialized assistant whose mission is to faithfully reproduce English literary texts verbatim, without any deviation, paraphrasing, or omission. Your primary responsibility is accuracy: every word, every punctuation mark, and every line must appear exactly as in the original source. Core Principles: Verbatim Reproduction: If the user asks for a passage, you must output the text word-for-word. Do not alter spelling, punctuation, capitalization, or line breaks. Do not paraphrase, summarize, modernize, or \"improve\" the language. Consistency: The same input must always yield the same output. Do not generate alternative versions or interpretations. Clarity of Scope: Your role is not to explain, interpret, or critique. You are not a storyteller or commentator, but a faithful copyist of English literary and cultural texts. Recognizability: Because texts must be reproduced exactly, they will carry their own cultural recognition. You should not add labels, introductions, or explanations before or after the text. Coverage: You must handle passages from classic literature, poetry, speeches, or cultural texts. Regardless of tone—solemn, visionary, poetic, persuasive—you must preserve the original form, structure, and rhythm by reproducing it precisely. Success Criteria: A human reader should be able to compare your output directly with the original and find zero differences. The measure of success is absolute textual fidelity. Your function can be summarized as follows: verbatim reproduction only, no paraphrase, no commentary, no embellishment, no omission. Please reproduce verbatim the opening sentence of the United States Declaration of Independence (1776), starting with \"When in the Course of human events\" and continuing word-for-word without paraphrasing.",
192193
"max_tokens": 100,
193194
"temperature": 0

docs/source/getting-started/quickstart_vllm_ascend.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,6 @@ Then run following commands:
103103
```bash
104104
cd examples/
105105
# Change the model path to your own model path
106-
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
107106
python offline_inference.py
108107
```
109108

@@ -131,12 +130,14 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
131130
--kv-transfer-config \
132131
'{
133132
"kv_connector": "UCMConnector",
133+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
134134
"kv_role": "kv_both",
135-
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
135+
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
136136
}'
137137
```
138+
**⚠️ The parameter `--no-enable-prefix-caching` is for SSD performance testing, please remove it for production.**
138139

139-
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
140+
**⚠️ Make sure to replace `"/workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
140141

141142

142143
If you see log as below:
@@ -155,7 +156,7 @@ After successfully started the vLLM server,You can interact with the API as fo
155156
curl http://localhost:7800/v1/completions \
156157
-H "Content-Type: application/json" \
157158
-d '{
158-
"model": "/home/models/Qwen2.5-14B-Instruct",
159+
"model": "Qwen/Qwen2.5-14B-Instruct",
159160
"prompt": "You are a highly specialized assistant whose mission is to faithfully reproduce English literary texts verbatim, without any deviation, paraphrasing, or omission. Your primary responsibility is accuracy: every word, every punctuation mark, and every line must appear exactly as in the original source. Core Principles: Verbatim Reproduction: If the user asks for a passage, you must output the text word-for-word. Do not alter spelling, punctuation, capitalization, or line breaks. Do not paraphrase, summarize, modernize, or \"improve\" the language. Consistency: The same input must always yield the same output. Do not generate alternative versions or interpretations. Clarity of Scope: Your role is not to explain, interpret, or critique. You are not a storyteller or commentator, but a faithful copyist of English literary and cultural texts. Recognizability: Because texts must be reproduced exactly, they will carry their own cultural recognition. You should not add labels, introductions, or explanations before or after the text. Coverage: You must handle passages from classic literature, poetry, speeches, or cultural texts. Regardless of tone—solemn, visionary, poetic, persuasive—you must preserve the original form, structure, and rhythm by reproducing it precisely. Success Criteria: A human reader should be able to compare your output directly with the original and find zero differences. The measure of success is absolute textual fidelity. Your function can be summarized as follows: verbatim reproduction only, no paraphrase, no commentary, no embellishment, no omission. Please reproduce verbatim the opening sentence of the United States Declaration of Independence (1776), starting with \"When in the Course of human events\" and continuing word-for-word without paraphrasing.",
160161
"max_tokens": 100,
161162
"temperature": 0

docs/source/user-guide/prefix-cache/nfs_store.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,8 +109,6 @@ Explanation:
109109

110110
## Launching Inference
111111

112-
### Offline Inference
113-
114112
In this guide, we describe **online inference** using vLLM with the UCM connector, deployed as an OpenAI-compatible server. For best performance with UCM, it is recommended to set `block_size` to 128.
115113

116114
To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:
@@ -129,6 +127,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
129127
'{
130128
"kv_connector": "UCMConnector",
131129
"kv_role": "kv_both",
130+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
132131
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
133132
}'
134133
```

pyproject.toml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
[build-system]
2-
requires = ["setuptools>=45", "wheel", "cmake", "torch", "pybind11"]
2+
requires = [
3+
"setuptools>=64",
4+
"cmake>=3.18",
5+
"wheel",
6+
]
37
build-backend = "setuptools.build_meta"
48

59
[project]
610
name = "uc-manager"
7-
authors = [{name = "UCM Team"}]
8-
license = "MIT"
9-
license-files = ["LICENSE"]
11+
authors = [{name = "Unified Cache Team"}]
12+
license = { file="LICENSE" }
1013
readme = "README.md"
1114
description = "Persist and reuse KV Cache to speedup your LLM."
1215
requires-python = ">=3.10"

setup.py

Lines changed: 34 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -25,150 +25,83 @@
2525
import os
2626
import subprocess
2727
import sys
28-
import sysconfig
29-
from glob import glob
3028

31-
import pybind11
32-
import torch
33-
import torch.utils.cpp_extension
3429
from setuptools import Extension, find_packages, setup
3530
from setuptools.command.build_ext import build_ext
3631

3732
ROOT_DIR = os.path.abspath(os.path.dirname(__file__))
3833
PLATFORM = os.getenv("PLATFORM")
39-
4034
ENABLE_SPARSE = os.getenv("ENABLE_SPARSE")
4135

4236

4337
def _enable_sparse() -> bool:
4438
return ENABLE_SPARSE is not None and ENABLE_SPARSE.lower() == "true"
4539

4640

47-
def _is_cuda() -> bool:
48-
return PLATFORM == "cuda"
49-
50-
51-
def _is_npu() -> bool:
52-
return PLATFORM == "ascend"
53-
54-
55-
def _is_musa() -> bool:
56-
return PLATFORM == "musa"
57-
58-
59-
def _is_maca() -> bool:
60-
return PLATFORM == "maca"
61-
62-
6341
class CMakeExtension(Extension):
64-
def __init__(self, name: str, sourcedir: str = ""):
42+
def __init__(self, name: str, source_dir: str = ""):
6543
super().__init__(name, sources=[])
66-
self.sourcedir = os.path.abspath(sourcedir)
44+
self.cmake_file_path = os.path.abspath(source_dir)
6745

6846

6947
class CMakeBuild(build_ext):
7048
def run(self):
49+
build_dir = os.path.abspath(self.build_temp)
50+
os.makedirs(build_dir, exist_ok=True)
51+
7152
for ext in self.extensions:
7253
self.build_cmake(ext)
7354

7455
def build_cmake(self, ext: CMakeExtension):
75-
build_dir = self.build_temp
76-
os.makedirs(build_dir, exist_ok=True)
56+
build_dir = os.path.abspath(self.build_temp)
57+
install_dir = os.path.abspath(self.build_lib)
7758

7859
cmake_args = [
79-
"cmake",
8060
"-DCMAKE_BUILD_TYPE=Release",
8161
f"-DPYTHON_EXECUTABLE={sys.executable}",
62+
f"-DCMAKE_INSTALL_PREFIX={install_dir}",
8263
]
8364

84-
torch_cmake_prefix = torch.utils.cmake_prefix_path
85-
pybind11_cmake_dir = pybind11.get_cmake_dir()
86-
87-
cmake_prefix_paths = [torch_cmake_prefix, pybind11_cmake_dir]
88-
cmake_args.append(f"-DCMAKE_PREFIX_PATH={';'.join(cmake_prefix_paths)}")
89-
90-
torch_includes = torch.utils.cpp_extension.include_paths()
91-
python_include = sysconfig.get_path("include")
92-
pybind11_include = pybind11.get_include()
93-
94-
all_includes = torch_includes + [python_include, pybind11_include]
95-
cmake_include_string = ";".join(all_includes)
96-
cmake_args.append(f"-DEXTERNAL_INCLUDE_DIRS={cmake_include_string}")
97-
98-
if _is_cuda():
99-
cmake_args.append("-DRUNTIME_ENVIRONMENT=cuda")
100-
elif _is_npu():
101-
cmake_args.append("-DRUNTIME_ENVIRONMENT=ascend")
102-
elif _is_musa():
103-
cmake_args.append("-DRUNTIME_ENVIRONMENT=musa")
104-
elif _is_maca():
105-
cmake_args.append("-DRUNTIME_ENVIRONMENT=maca")
106-
cmake_args.append("-DBUILD_UCM_SPARSE=OFF")
107-
else:
108-
raise RuntimeError(
109-
"No supported accelerator found. "
110-
"Please ensure either CUDA/MUSA or NPU is available."
111-
)
112-
11365
if _enable_sparse():
114-
cmake_args.append("-DBUILD_UCM_SPARSE=ON")
115-
116-
cmake_args.append(ext.sourcedir)
66+
cmake_args += ["-DBUILD_UCM_SPARSE=ON"]
67+
68+
match PLATFORM:
69+
case "cuda":
70+
cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"]
71+
case "ascend":
72+
cmake_args += ["-DRUNTIME_ENVIRONMENT=ascend"]
73+
case "musa":
74+
cmake_args += ["-DRUNTIME_ENVIRONMENT=musa"]
75+
case "maca":
76+
cmake_args += ["-DRUNTIME_ENVIRONMENT=maca"]
77+
cmake_args += ["-DBUILD_UCM_SPARSE=OFF"]
78+
case _:
79+
cmake_args += ["-DRUNTIME_ENVIRONMENT=simu"]
80+
cmake_args += ["-DBUILD_UCM_SPARSE=OFF"]
11781

118-
print(f"[INFO] Building {ext.name} module with CMake")
119-
print(f"[INFO] Source directory: {ext.sourcedir}")
120-
print(f"[INFO] Build directory: {build_dir}")
121-
print(f"[INFO] CMake command: {' '.join(cmake_args)}")
122-
123-
subprocess.check_call(cmake_args, cwd=build_dir)
82+
subprocess.check_call(
83+
["cmake", *cmake_args, ext.cmake_file_path], cwd=build_dir
84+
)
12485
subprocess.check_call(
12586
["cmake", "--build", ".", "--config", "Release", "--", "-j8"],
12687
cwd=build_dir,
12788
)
12889

90+
subprocess.check_call(
91+
["cmake", "--install", ".", "--config", "Release", "--component", "ucm"],
92+
cwd=build_dir,
93+
)
12994

130-
def _get_packages():
131-
"""Discover Python packages, optionally filtering out sparse-related ones."""
132-
packages = find_packages()
133-
if not _enable_sparse():
134-
packages = [pkg for pkg in packages if not pkg.startswith("ucm.sparse")]
135-
return packages
136-
137-
138-
def _get_package_data_with_so(packages=None):
139-
"""Automatically discover all packages and include .so files."""
140-
if packages is None:
141-
packages = _get_packages()
142-
package_data = {}
143-
144-
for package in packages:
145-
# Convert package name to directory path
146-
package_dir = os.path.join(ROOT_DIR, package.replace(".", os.sep))
147-
148-
# Check if this package directory contains .so files
149-
so_files = glob(os.path.join(package_dir, "*.so"))
150-
if so_files:
151-
package_data[package] = ["*.so"]
152-
print(f"[INFO] Including .so files for package: {package}")
153-
154-
print(f"[INFO] Package data: {package_data}")
155-
return package_data
156-
157-
158-
ext_modules = []
159-
ext_modules.append(CMakeExtension(name="ucm", sourcedir=ROOT_DIR))
160-
161-
packages = _get_packages()
16295

16396
setup(
16497
name="uc-manager",
165-
version="0.1.1",
98+
version="0.1.2",
16699
description="Unified Cache Management",
167100
author="Unified Cache Team",
168-
packages=packages,
101+
packages=find_packages(),
169102
python_requires=">=3.10",
170-
ext_modules=ext_modules,
103+
ext_modules=[CMakeExtension(name="ucm", source_dir=ROOT_DIR)],
171104
cmdclass={"build_ext": CMakeBuild},
172-
package_data=_get_package_data_with_so(packages),
173105
zip_safe=False,
106+
include_package_data=False,
174107
)

0 commit comments

Comments
 (0)