Skip to content

Commit 96de58b

Browse files
committed
docs: add benchmark on readme.
1 parent cb4287f commit 96de58b

9 files changed

+156
-27
lines changed

README.md

Lines changed: 82 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@
1313

1414
</div>
1515

16+
17+
## News
18+
19+
- [2024/12] 🔥 DashInfer: Announcing the release of v2.0, now with enhanced GPU (CUDA) support! This version includes features like prefix caching (with GPU & CPU swapping), guided decoding, optimized attention for GQA, a lockless reactor engine, and newly added support for the VLM model (Qwen-VL) and MoE Models. For more details, please refer to the [release notes](https://dashinfer.readthedocs.io/en/latest/index.html#v2-0-0).
20+
21+
- [2024/06] DashInfer: v1.0 release with x86 & ARMv9 CPU and CPU flash attention support.
22+
1623
# Introduction
1724

1825
Written in C++ runtime, DashInfer aims to deliver production-level implementations highly optimized for various hardware architectures, including CUDA, x86 and ARMv9.
@@ -55,9 +62,9 @@ DashInfer is a highly optimized LLM inference engine with the following core fea
5562
- **ARMv9 CPU**: Hardware support for SVE instruction set is required. DashInfer supports ARMv9 architecture processors such as Yitian710, corresponding to Aliyun's 8th generation ECS instances (e.g. g8y), and adopts SVE instruction to accelerate caculation.
5663

5764
## Data Types
58-
- **CUDA GPUs**: FP16, BF16, FP32, Int8(InstantQuant), Int4(InstantQuant)
65+
- **CUDA GPUs**: FP16, BF16, FP8, FP32, Int8(InstantQuant), Int4(InstantQuant)
5966
- **x86 CPU**: FP32, BF16
60-
- **ARM Yitian710 CPU**: FP32, BF16, InstantQuant
67+
- **ARM Yitian710 CPU**: FP32, BF16, Int8(InstantQuant)
6168

6269
### Quantization
6370
DashInfer provides various many quantization technology for LLM weight, such as, int{8,4} weight only quantization, int8 activate quantization, and many customized fused kernel to provide best performance on specified device.
@@ -66,10 +73,10 @@ To put it simply, models fine-tuned with GPTQ will provide better accuracy, but
6673
which does not require fine-tuning, can offer a faster deployment experience.
6774
Detailed explanations of IQ quantization can be found at the end of this article.
6875

69-
In terms of supported quantization algorithms, AllSpark supports models fine-tuned with GPTQ and dynamic quantization
76+
In terms of supported quantization algorithms, DashInfer supports models fine-tuned with GPTQ and dynamic quantization
7077
using the IQ quantization technique in two ways:
7178

72-
- **IntantQuant(IQ)**: AllSpark provides the InstantQuant (IQ) dynamic quantization technique, which does not require fine-tuning and can offer a faster deployment experience. Detailed explanations of IQ quantization can be found at the end of this article.
79+
- **IntantQuant(IQ)**: DashInfer provides the InstantQuant (IQ) dynamic quantization technique, which does not require fine-tuning and can offer a faster deployment experience. Detailed explanations of IQ quantization can be found at the end of this article.
7380
- **GPTQ**: Models fine-tuned with GPTQ will provide better accuracy, but it requires a fine-tuning step.
7481

7582
The quantization strategies introduced here can be broadly divided into two categories:
@@ -82,10 +89,35 @@ The quantization strategies introduced here can be broadly divided into two cate
8289

8390
In terms of quantization granularity, there are two types:
8491

85-
- **Per-Channel**: AllSpark's quantization techniques at least adopt the Per-Channel (also known as Per-Token) quantization granularity, and some also provide Sub-Channel quantization granularity. Generally speaking, Per-Channel quantization can meet most accuracy requirements due to its simple implementation and optimal performance. Only when the accuracy of Per-Channel quantization is insufficient should the Sub-Channel quantization strategy be considered.
86-
- **Sub-Channel**: Compared to Per-Channel quantization, Sub-Channel refers to dividing a channel into N groups, and calculating quantization parameters within each group. This quantization granularity typically provides better accuracy, but due to increased implementation complexity, it comes with many limitations. For example, performance may be slightly slower than Per-Channel quantization, and Activation quantization is difficult to implement Sub-Channel quantization due to computational formula constraints (AllSpark's Activation quantization is all Per-Channel).
92+
- **Per-Channel**: DashInfer's quantization techniques at least adopt the Per-Channel (also known as Per-Token) quantization granularity, and some also provide Sub-Channel quantization granularity. Generally speaking, Per-Channel quantization can meet most accuracy requirements due to its simple implementation and optimal performance. Only when the accuracy of Per-Channel quantization is insufficient should the Sub-Channel quantization strategy be considered.
93+
- **Sub-Channel**: Compared to Per-Channel quantization, Sub-Channel refers to dividing a channel into N groups, and calculating quantization parameters within each group. This quantization granularity typically provides better accuracy, but due to increased implementation complexity, it comes with many limitations. For example, performance may be slightly slower than Per-Channel quantization, and Activation quantization is difficult to implement Sub-Channel quantization due to computational formula constraints (DashInfer's Activation quantization is all Per-Channel).
94+
95+
# Documentation and Example Code
96+
97+
## Documentation
98+
99+
For the detailed user manual, please refer to the documentation: [Documentation Link](https://dashinfer.readthedocs.io/en/latest/).
100+
101+
### Quick Start:
87102

88-
# Examples
103+
1. Using API [Python Quick Start](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_py_en.html)
104+
2. LLM OpenAI Server [Quick Start Guide for OpenAI API Server](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_server_en.html)
105+
3. VLM OpenAI Server [VLM Support](https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html)
106+
107+
### Feature Introduction:
108+
109+
1. [Prefix Cache](https://dashinfer.readthedocs.io/en/latest/llm/prefix_caching.html)
110+
2. [Guided Decoding](https://dashinfer.readthedocs.io/en/latest/llm/guided_decoding.html)
111+
3. [Engine Config](https://dashinfer.readthedocs.io/en/latest/llm/runtime_config.html)
112+
113+
### Development:
114+
115+
1. [Development Guide](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#)
116+
2. [Build From Source](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#build-from-source-code)
117+
3. [OP Profiling](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#profiling)
118+
4. [Environment Variable](https://dashinfer.readthedocs.io/en/latest/get_started/env_var_options_en.html)
119+
120+
## Code Examples
89121

90122
In `<path_to_dashinfer>/examples` there are examples for C++ and Python interfaces, and please refer to the documentation in `<path_to_dashinfer>/documents/EN` to run the examples.
91123

@@ -97,6 +129,44 @@ In `<path_to_dashinfer>/examples` there are examples for C++ and Python interfac
97129

98130
The VLM Support in [multimodal](multimodal/) folder, it's a toolkit to support Vision Language Models (VLMs) inference based on the DashInfer engine. It's compatible with the OpenAI Chat Completion API, supporting text and image/video inputs.
99131

132+
## Performance
133+
134+
We have conducted several benchmarks to compare the performance of mainstream LLM inference engines.
135+
136+
### Multi-Modal Model (VLMs)
137+
138+
We compared the performance of Qwen-VL with vllm across various model sizes:
139+
140+
![img_1.png](docs/resources/image/dashinfer-benchmark-vl.png)
141+
142+
Benchmarks were conducted using an A100-80Gx1 for 2B and 7B sizes, and an A100-80Gx4 for the 72B model. For more details, please refer to the [benchmark documentation](https://github.com/modelscope/dash-infer/blob/main/multimodal/tests/README.md).
143+
144+
### Prefix Cache
145+
146+
We evaluated the performance of the prefix cache at different cache hit rates:
147+
148+
![dahsinfer-benchmark-prefix-cache.png](docs/resources/image/dahsinfer-benchmark-prefix-cache.png)
149+
150+
The chart above shows the reduction in TTFT (Time to First Token) with varying PrefixCache hit rates in DashInfer.
151+
152+
![dashinfer-prefix-effect.png](docs/resources/image/dashinfer-prefix-effect.png)
153+
154+
**Test Setup:**
155+
- **Model:** Qwen2-72B-Instruct
156+
- **GPU:** 4x A100
157+
- **Runs:** 20
158+
- **Batch Size:** 1
159+
- **Input Tokens:** 4000
160+
- **Output Tokens:** 1
161+
162+
### Guided Decoding (JSON Mode)
163+
164+
We compared the guided output (in JSON format) between different engines using the same request with a customized JSON schema (Context Length: 45, Generated Length: 63):
165+
166+
![dashinfer-benchmark-json-mode.png](docs/resources/image/dashinfer-benchmark-json-mode.png)
167+
168+
169+
100170
# Future Plans
101171
- [x] GPU Support
102172
- [x] Multi Modal Model support
@@ -107,8 +177,11 @@ The VLM Support in [multimodal](multimodal/) folder, it's a toolkit to support V
107177
- [x] Support MoE architecture
108178
- [x] Guided output: Json Mode
109179
- [x] Prefix Cache: Support GPU Prefix Cache and CPU Swap
110-
- [ ] Quantization: Fp8 support on CUDA.
111-
- [ ] LORA: Continues Batch LORA Optimization.
180+
- [x] Quantization: Fp8 A8W8 Activation quantization support on CUDA.
181+
- [x] LORA: Continues Batch LORA Optimization.
182+
- [ ] Parallel Context phase and Generation phase within engine.
183+
- [ ] More effective MoE Operator on GPU.
184+
- [ ] Porting to AMD(ROCm) Platform.
112185

113186
# License
114187

README_CN.md

Lines changed: 73 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -52,43 +52,99 @@ DashInfer 是一个高度优化的 LLM 推理引擎,具有以下核心特性
5252
DashInfer 为 LLM 权重提供了多种量化技术,例如 int{8,4} 仅权重量化、int8 激活量化,还有许多定制的融合内核,以在指定设备上提供最佳性能。简而言之,使用 GPTQ 微调的模型将提供更好的准确性,而我们无需微调的 InstantQuant (IQ) 技术可提供更快的部署体验。IQ 量化的详细解释可以在本文末尾找到。
5353

5454
在支持的量化算法方面,AllSpark 通过两种方式支持使用 GPTQ 微调的模型和使用 IQ 量化技术的动态量化:
55-
- **InstantQuant (IQ)**: AllSpark 提供了 InstantQuant (IQ) 动态量化技术,无需微调即可提供更快的部署体验。IQ 量化的详细解释可以在本文末尾找到。
55+
- **InstantQuant (IQ)**: DashInfer 提供了 InstantQuant (IQ) 动态量化技术,无需微调即可提供更快的部署体验。IQ 量化的详细解释可以在本文末尾找到。
5656
- **GPTQ**: 使用 GPTQ 微调的模型将提供更好的准确性,但它需要一个微调步骤。
5757

5858
这里介绍的量化策略大致可以分为两类:
5959
- **仅权重量化**: 这种量化技术仅对权重进行量化和压缩,例如以 int8 格式存储权重,但在计算时仍旧使用 bf16/fp16。它只是减少了内存访问需求,相比 BF16 并没有提高计算性能。
6060
- **激活量化**: 这种量化技术不仅以 int8 格式存储权重,还在计算阶段执行低精度量化计算(如 int8)。由于 Nvidia GPU 只有 int8 Tensor Core 容易保持精度,这种量化技术既能减少内存访问需求,又能提高计算性能,使其成为理想的量化方法。在准确性方面,它相比仅权重量化可能会有轻微下降,因此需要业务数据的准确性测试。
6161

6262
在量化粒度方面,有两种类型:
63-
- **每通道量化**: AllSpark 的量化技术至少采用了每通道(也称为每 Token)量化粒度,有些还提供了子通道量化粒度。一般而言,每通道量化由于实现简单且性能最佳,通常能满足大多数准确性需求。只有当每通道量化的准确性不足时,才应考虑子通道量化策略。
64-
- **子通道量化**: 与每通道量化相比,子通道量化是指将一个通道划分为 N 组,并在每组内计算量化参数。这种量化粒度通常能提供更好的准确性,但由于实现复杂度增加,带来了许多限制。例如,性能可能比每通道量化稍慢,并且由于计算公式限制,激活量化难以实现子通道量化(AllSpark 的激活量化都是每通道量化)。
63+
- **每通道量化**: DashInfer 的量化技术至少采用了每通道(也称为每 Token)量化粒度,有些还提供了子通道量化粒度。一般而言,每通道量化由于实现简单且性能最佳,通常能满足大多数准确性需求。只有当每通道量化的准确性不足时,才应考虑子通道量化策略。
64+
- **子通道量化**: 与每通道量化相比,子通道量化是指将一个通道划分为 N 组,并在每组内计算量化参数。这种量化粒度通常能提供更好的准确性,但由于实现复杂度增加,带来了许多限制。例如,性能可能比每通道量化稍慢,并且由于计算公式限制,激活量化难以实现子通道量化(DashInfer的激活量化都是每通道量化)。
6565

66-
# 示例代码
66+
# 依赖
67+
1. Python: DashInfer python package, 目前只依赖pytorch和huggingface(做safetensor模型权重加载),但是由于运行时转换得调用HF接口进行模型权重加载,所以各个模型可能有自己的依赖。
68+
2. C++: 目前C++ Package全部静态编译了第三方依赖库,并且做了符号隐藏,所以目前C++ Package 无任何第三方库的运行时依赖。
69+
70+
71+
# 文档和示例代码
72+
73+
## 文档
74+
75+
详细的用户手册请参考文档: [文档地址](https://dashinfer.readthedocs.io/en/latest/)
76+
77+
### Quick Start:
78+
79+
1. API使用 [Python Quick Start](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_py_en.html)
80+
2. LLM OpenAI Server [Quick Start Guide for OpenAI API Server](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_server_en.html)
81+
3. VLM OpenAI Server [VLM Support)(https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html)
82+
83+
### Feature介绍:
84+
85+
1. [Prefix Cache](https://dashinfer.readthedocs.io/en/latest/llm/prefix_caching.html)
86+
2. [Guided Decoding](https://dashinfer.readthedocs.io/en/latest/llm/guided_decoding.html)
87+
3. [Engine Config](https://dashinfer.readthedocs.io/en/latest/llm/runtime_config.html)
88+
89+
### 开发相关:
90+
91+
1. [Development Guide](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#)
92+
2. [Build From Source](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#build-from-source-code)
93+
3. [OP Profling](https://dashinfer.readthedocs.io/en/latest/devel/source_code_build_en.html#profiling)
94+
4. [Environment Variable](https://dashinfer.readthedocs.io/en/latest/get_started/env_var_options_en.html)
95+
96+
## 代码示例
6797

6898
`<path_to_dashinfer>/examples`下提供了C++、python接口的调用示例,请参考`<path_to_dashinfer>/documents/CN`目录下的文档运行示例。
6999

70-
- [基础Python示例](examples/python/0_basic/basic_example_qwen_v10_io.ipynb) [![Open In PAI-DSW](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/Open-in-DSW20px.svg)](https://gallery.pai-ml.com/#/import/https://github.com/modelscope/dash-infer/blob/main/examples/python/0_basic/basic_example_qwen_v10_io.ipynb)
71100
- [所有Python示例文档](docs/CN/examples_python.md)
72101
- [C++示例文档](docs/CN/examples_cpp.md)
102+
- [Python Benchmark](https://github.com/modelscope/dash-infer/tree/main/examples/benchmark)
73103

74104
## 多模态模型支持
75105

76106
[multimodal](multimodal/) 目录下是基于DashInfer实现的多模态模型推理工具,兼容OpenAI Chat Completion API,支持文字、图片、视频输入。
77107

78-
# 未来规划
108+
# 性能
109+
110+
我们进行了一系列基准测试,以比较主流 LLM 推理引擎的性能。
111+
112+
### 多模态模型 (VLMs)
113+
114+
我们比较了不同规模模型下 Qwen-VL 与 vllm 的性能:
115+
116+
![img_1.png](docs/resources/image/dashinfer-benchmark-vl.png)
117+
118+
基准测试使用了 A100-80Gx1 测试 2B 和 7B 模型,使用 A100-80Gx4 测试 72B 模型。更多详情,请参考[基准文档](https://github.com/modelscope/dash-infer/blob/main/multimodal/tests/README.md)
119+
120+
## Prefix Cache
121+
122+
我们评估了在不同缓存命中率下前缀缓存的性能:
123+
124+
![dahsinfer-benchmark-prefix-cache.png](docs/resources/image/dahsinfer-benchmark-prefix-cache.png)
125+
126+
上图显示了 DashInfer 中 TTFT(首次生成 Token 的时间)随着不同 PrefixCache 命中率的减少情况。
127+
128+
![dashinfer-prefix-effect.png](docs/resources/image/dashinfer-prefix-effect.png)
129+
130+
**测试设置:**
131+
- **模型:** Qwen2-72B-Instruct
132+
- **GPU:** 4x A100
133+
- **运行次数:** 20
134+
- **批处理大小:** 1
135+
- **输入 Tokens:** 4000
136+
- **输出 Tokens:** 1
137+
138+
## Guided Decode
139+
140+
我们在相同请求下使用自定义 JSON 架构(A100x1 7B Qwen, 上下文长度:45,生成长度:63),比较了不同引擎的Guided Decode的性能,图中数据为整体RT :
141+
142+
![dashinfer-benchmark-json-mode.png](docs/resources/image/dashinfer-benchmark-json-mode.png)
79143

80-
- [x] GPU 支持
81-
- [x] 多模态模型支持
82-
- [x] 使用 Flash-Attention 加速注意力机制
83-
- [x] 将上下文长度扩展到超过 32k
84-
- [x] 支持 4 位量化
85-
- [x] 支持使用 GPTQ 微调的量化模型
86-
- [x] 支持 MoE 架构
87-
- [x] 引导输出:Json 模式
88-
- [x] 前缀缓存:支持 GPU 前缀缓存和 CPU 交换
89-
- [ ] 量化:CUDA 上的 Fp8 支持
90-
- [ ] LORA:持续批量 LORA 优化
144+
# 子项目
91145

146+
1. [HIE-DNN](https://github.com/modelscope/dash-infer/tree/main/HIE-DNN) 为DashInfer所使用的计算库。
147+
2. [Span Attention](https://github.com/modelscope/dash-infer/tree/main/span-attention) 为DashInfer GPU实现的GPU PageAttention
92148

93149
# License
94150

379 KB
Loading
194 KB
Loading
426 KB
Loading
529 KB
Loading
519 KB
Loading
122 KB
Loading

docs/sphinx/llm/llm_offline_inference_en.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
=====================================
2-
aaa Offline Inference with Python API
2+
Offline Inference with Python API
33
=====================================
44

55
We have presented a quick start example of LLM inference with Python API in

0 commit comments

Comments
 (0)