You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+82-9Lines changed: 82 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,13 @@
13
13
14
14
</div>
15
15
16
+
17
+
## News
18
+
19
+
-[2024/12] 🔥 DashInfer: Announcing the release of v2.0, now with enhanced GPU (CUDA) support! This version includes features like prefix caching (with GPU & CPU swapping), guided decoding, optimized attention for GQA, a lockless reactor engine, and newly added support for the VLM model (Qwen-VL) and MoE Models. For more details, please refer to the [release notes](https://dashinfer.readthedocs.io/en/latest/index.html#v2-0-0).
20
+
21
+
-[2024/06] DashInfer: v1.0 release with x86 & ARMv9 CPU and CPU flash attention support.
22
+
16
23
# Introduction
17
24
18
25
Written in C++ runtime, DashInfer aims to deliver production-level implementations highly optimized for various hardware architectures, including CUDA, x86 and ARMv9.
@@ -55,9 +62,9 @@ DashInfer is a highly optimized LLM inference engine with the following core fea
55
62
-**ARMv9 CPU**: Hardware support for SVE instruction set is required. DashInfer supports ARMv9 architecture processors such as Yitian710, corresponding to Aliyun's 8th generation ECS instances (e.g. g8y), and adopts SVE instruction to accelerate caculation.
DashInfer provides various many quantization technology for LLM weight, such as, int{8,4} weight only quantization, int8 activate quantization, and many customized fused kernel to provide best performance on specified device.
@@ -66,10 +73,10 @@ To put it simply, models fine-tuned with GPTQ will provide better accuracy, but
66
73
which does not require fine-tuning, can offer a faster deployment experience.
67
74
Detailed explanations of IQ quantization can be found at the end of this article.
68
75
69
-
In terms of supported quantization algorithms, AllSpark supports models fine-tuned with GPTQ and dynamic quantization
76
+
In terms of supported quantization algorithms, DashInfer supports models fine-tuned with GPTQ and dynamic quantization
70
77
using the IQ quantization technique in two ways:
71
78
72
-
-**IntantQuant(IQ)**: AllSpark provides the InstantQuant (IQ) dynamic quantization technique, which does not require fine-tuning and can offer a faster deployment experience. Detailed explanations of IQ quantization can be found at the end of this article.
79
+
-**IntantQuant(IQ)**: DashInfer provides the InstantQuant (IQ) dynamic quantization technique, which does not require fine-tuning and can offer a faster deployment experience. Detailed explanations of IQ quantization can be found at the end of this article.
73
80
-**GPTQ**: Models fine-tuned with GPTQ will provide better accuracy, but it requires a fine-tuning step.
74
81
75
82
The quantization strategies introduced here can be broadly divided into two categories:
@@ -82,10 +89,35 @@ The quantization strategies introduced here can be broadly divided into two cate
82
89
83
90
In terms of quantization granularity, there are two types:
84
91
85
-
-**Per-Channel**: AllSpark's quantization techniques at least adopt the Per-Channel (also known as Per-Token) quantization granularity, and some also provide Sub-Channel quantization granularity. Generally speaking, Per-Channel quantization can meet most accuracy requirements due to its simple implementation and optimal performance. Only when the accuracy of Per-Channel quantization is insufficient should the Sub-Channel quantization strategy be considered.
86
-
-**Sub-Channel**: Compared to Per-Channel quantization, Sub-Channel refers to dividing a channel into N groups, and calculating quantization parameters within each group. This quantization granularity typically provides better accuracy, but due to increased implementation complexity, it comes with many limitations. For example, performance may be slightly slower than Per-Channel quantization, and Activation quantization is difficult to implement Sub-Channel quantization due to computational formula constraints (AllSpark's Activation quantization is all Per-Channel).
92
+
-**Per-Channel**: DashInfer's quantization techniques at least adopt the Per-Channel (also known as Per-Token) quantization granularity, and some also provide Sub-Channel quantization granularity. Generally speaking, Per-Channel quantization can meet most accuracy requirements due to its simple implementation and optimal performance. Only when the accuracy of Per-Channel quantization is insufficient should the Sub-Channel quantization strategy be considered.
93
+
-**Sub-Channel**: Compared to Per-Channel quantization, Sub-Channel refers to dividing a channel into N groups, and calculating quantization parameters within each group. This quantization granularity typically provides better accuracy, but due to increased implementation complexity, it comes with many limitations. For example, performance may be slightly slower than Per-Channel quantization, and Activation quantization is difficult to implement Sub-Channel quantization due to computational formula constraints (DashInfer's Activation quantization is all Per-Channel).
94
+
95
+
# Documentation and Example Code
96
+
97
+
## Documentation
98
+
99
+
For the detailed user manual, please refer to the documentation: [Documentation Link](https://dashinfer.readthedocs.io/en/latest/).
100
+
101
+
### Quick Start:
87
102
88
-
# Examples
103
+
1. Using API [Python Quick Start](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_py_en.html)
104
+
2. LLM OpenAI Server [Quick Start Guide for OpenAI API Server](https://dashinfer.readthedocs.io/en/latest/get_started/quick_start_api_server_en.html)
105
+
3. VLM OpenAI Server [VLM Support](https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html)
In `<path_to_dashinfer>/examples` there are examples for C++ and Python interfaces, and please refer to the documentation in `<path_to_dashinfer>/documents/EN` to run the examples.
91
123
@@ -97,6 +129,44 @@ In `<path_to_dashinfer>/examples` there are examples for C++ and Python interfac
97
129
98
130
The VLM Support in [multimodal](multimodal/) folder, it's a toolkit to support Vision Language Models (VLMs) inference based on the DashInfer engine. It's compatible with the OpenAI Chat Completion API, supporting text and image/video inputs.
99
131
132
+
## Performance
133
+
134
+
We have conducted several benchmarks to compare the performance of mainstream LLM inference engines.
135
+
136
+
### Multi-Modal Model (VLMs)
137
+
138
+
We compared the performance of Qwen-VL with vllm across various model sizes:
Benchmarks were conducted using an A100-80Gx1 for 2B and 7B sizes, and an A100-80Gx4 for the 72B model. For more details, please refer to the [benchmark documentation](https://github.com/modelscope/dash-infer/blob/main/multimodal/tests/README.md).
143
+
144
+
### Prefix Cache
145
+
146
+
We evaluated the performance of the prefix cache at different cache hit rates:
We compared the guided output (in JSON format) between different engines using the same request with a customized JSON schema (Context Length: 45, Generated Length: 63):
-**子通道量化**: 与每通道量化相比,子通道量化是指将一个通道划分为 N 组,并在每组内计算量化参数。这种量化粒度通常能提供更好的准确性,但由于实现复杂度增加,带来了许多限制。例如,性能可能比每通道量化稍慢,并且由于计算公式限制,激活量化难以实现子通道量化(AllSpark 的激活量化都是每通道量化)。
-**子通道量化**: 与每通道量化相比,子通道量化是指将一个通道划分为 N 组,并在每组内计算量化参数。这种量化粒度通常能提供更好的准确性,但由于实现复杂度增加,带来了许多限制。例如,性能可能比每通道量化稍慢,并且由于计算公式限制,激活量化难以实现子通道量化(DashInfer的激活量化都是每通道量化)。
-[基础Python示例](examples/python/0_basic/basic_example_qwen_v10_io.ipynb)[](https://gallery.pai-ml.com/#/import/https://github.com/modelscope/dash-infer/blob/main/examples/python/0_basic/basic_example_qwen_v10_io.ipynb)
0 commit comments