You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,7 +121,9 @@ For the detailed user manual, please refer to the documentation: [Documentation
121
121
122
122
In `<path_to_dashinfer>/examples` there are examples for C++ and Python interfaces, and please refer to the documentation in `<path_to_dashinfer>/documents/EN` to run the examples.
123
123
124
-
-[Basic Python Example](examples/python/0_basic/basic_example_qwen_v10_io.ipynb)[](https://gallery.pai-ml.com/#/import/https://github.com/modelscope/dash-infer/blob/main/examples/python/0_basic/basic_example_qwen_v10_io.ipynb)
124
+
125
+
126
+
-[Base GPU Python Example](examples/python/0_basic/cuda/demo_dashinfer_2_0_gpu_example.ipynb)[](https://colab.research.google.com/github/modelscope/dash-infer/blob/main/examples/python/0_basic/cuda/demo_dashinfer_2_0_gpu_example.ipynb)
125
127
-[Documentation for All Python Examples](docs/EN/examples_python.md)
126
128
-[Documentation for C++ Examples](docs/EN/examples_cpp.md)
@@ -17,93 +17,85 @@ Below is an example of how to quickly serialize a Hugging Face model and perform
17
17
### Inference Python Example
18
18
This is an example of using asynchronous interface to obtain output, with bfloat16, in memory model serialize, and async output processing. The model is downloaded from Modelscope. Initiating requests and receiving outputs are both asynchronous, and can be handled according to your application needs.
19
19
20
-
```py
21
-
import os
22
-
import modelscope
23
-
from modelscope.utils.constant importDEFAULT_MODEL_REVISION
24
-
25
-
from dashinfer import allspark
26
-
from dashinfer.allspark import*
27
-
from dashinfer.allspark.engine import*
28
-
from dashinfer.allspark.prompt_utils import PromptTemplate
29
-
30
-
# if use in memory serialize, change this flag to True
@@ -112,7 +104,9 @@ In this example, the `HuggingFaceModel` (`dashinfer.allspark.model_loader.Huggin
112
104
113
105
If you want to convert only once, pass `skip_if_exists=True`. If existing files are found, the model conversion step will be skipped. The model files will reside in the `{output_base_folder}` directory, generating two files: `{safe_model_name}.asparam`, `{safe_model_name}.asmodel`. The `free_model()` function will release the Hugging Face model files to save memory.
@@ -140,21 +136,19 @@ In this code section, inference is conducted using a single CUDA card, with the
140
136
If using in-memory serialization, you can release the memory file after `install_model`, since it is no longer needed.
141
137
142
138
```python
143
-
if in_memory:
144
-
model_loader.free_memory_serialize_file()
139
+
if in_memory: model_loader.free_memory_serialize_file()
140
+
145
141
```
146
142
147
143
Upon calling `start_model`, the engine will perform a warm-up step that simulates a run with the maximum length set in the runtime parameters and the maximum batch size to ensure that no new resources will be requested during subsequent runs, ensuring stability. If the warm-up fails, reduce the length settings in the runtime configurations to lower resource demand. After completion of the warm-up, the engine enters a state ready to accept requests.
# change generate config based on this generation config, like change top_k = 1
170
163
gen_cfg.update({"top_k": 1})
171
-
gen_cfg.update({"repetition_penalty": 1.1})
172
164
```
173
165
This code takes recommended generation parameters from Hugging Face's `generation_config.json` and makes optional modifications. It then asynchronously initiates model inference, where `status` indicates the success of the API. If successful, `handle` and `queue` are used for subsequent requests. The `handle` represents the request handle, while `queue` indicates the output queue; each request has its own output queue, which continuously accumulates generated tokens. This queue will only be released after `release_request` is invoked.
DashInfer prioritizes asynchronous APIs for optimal performance and to align with the inherent nature of LLMs. Sending and receiving requests is primarily designed for asynchronous operation. However, for compatibility with user preferences accustomed to synchronous calls, we provide `engine.sync_request()`. This API allows users to block until the generation request completes.
183
175
176
+
##### 5.1 Asynchronous Processing
177
+
Asynchronous processing differs in that it requires repeated calls to the queue until the status changes to `GenerateRequestStatus.ContextFinished`. A normal state machine transition goes:
178
+
`Init` (initial state) -> `ContextFinished` (prefill completed and first token generated) ->
179
+
`Generating` (in progress) -> `GenerateFinished` (completed).
180
+
During this normal state transition, an exceptional state can occur: `GenerateInterrupted`, which indicates resource shortages, causing the request to pause while its resources are temporarily released for others. This often happens under heavy loads.
181
+
182
+
```python
183
+
generated_ids = []
184
+
whileTrue:
185
+
elements = queue.Get()
186
+
if elements:
187
+
generated_ids += elements.ids_from_generate
188
+
status = queue.GenerateStatus()
189
+
if status in [GenerateRequestStatus.GenerateFinished, GenerateRequestStatus.GenerateInterrupted]:
190
+
break
191
+
```
192
+
193
+
##### 5.2 Synchronous Processing
194
+
184
195
The subsequent call to `sync_request` will block until generation is finished, simulating a synchronous call. Without this invocation, operations on the queue can proceed but will require polling. The following code synchronously fetches all currently generated IDs from the queue, blocking at this point if there are IDs yet to be generated until completion or an error occurs.
185
196
197
+
Sync processing is not showing in this example code, you can modify example following code.
Asynchronous processing differs in that it requires repeated calls to the queue until the status changes to `GenerateRequestStatus.ContextFinished`. A normal state machine transition goes:
212
-
`Init` (initial state) -> `ContextFinished` (prefill completed and first token generated) ->
213
-
`Generating` (in progress) -> `GenerateFinished` (completed).
214
-
During this normal state transition, an exceptional state can occur: `GenerateInterrupted`, which indicates resource shortages, causing the request to pause while its resources are temporarily released for others. This often happens under heavy loads.
0 commit comments