You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
@@ -17,7 +18,7 @@ If you are using the Vector cluster environment, and you don't need any customiz
17
18
```bash
18
19
pip install vec-inf
19
20
```
20
-
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package
21
+
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
21
22
22
23
## Usage
23
24
@@ -85,7 +86,7 @@ models:
85
86
vllm_args:
86
87
--max-model-len: 1010000
87
88
--max-num-seqs: 256
88
-
--compilation-confi: 3
89
+
--compilation-config: 3
89
90
```
90
91
91
92
You would then set the `VEC_INF_CONFIG` path using:
@@ -94,7 +95,11 @@ You would then set the `VEC_INF_CONFIG` path using:
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
98
+
**NOTE**
99
+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
100
+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
101
+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
102
+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
98
103
99
104
#### Other commands
100
105
@@ -161,8 +166,9 @@ Once the inference server is ready, you can start sending in inference requests.
161
166
},
162
167
"prompt_logprobs":null
163
168
}
169
+
164
170
```
165
-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
171
+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`.
166
172
167
173
## SSH tunnel from your local device
168
174
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
Copy file name to clipboardExpand all lines: docs/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,4 +10,4 @@ If you are using the Vector cluster environment, and you don't need any customiz
10
10
pip install vec-inf
11
11
```
12
12
13
-
Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package.
13
+
Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/config.py) for details.
94
+
**NOTE**
95
+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/config.py) for details.
96
+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments. The default parallel size for any parallelization defaults to 1, so none of the sizes were set specifically in this example.
97
+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`.
98
+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
95
99
96
100
### `status` command
97
101
@@ -254,8 +258,7 @@ Once the inference server is ready, you can start sending in inference requests.
254
258
}
255
259
```
256
260
257
-
258
-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
261
+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`.
Copy file name to clipboardExpand all lines: examples/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,3 +9,4 @@
9
9
-[`logits.py`](logits/logits.py): Python example of getting logits from hosted model.
10
10
-[`api`](api): Examples for using the Python API
11
11
-[`basic_usage.py`](api/basic_usage.py): Basic Python example demonstrating the Vector Inference API
12
+
-[`slurm_dependency`](slurm_dependency): Example of launching a model with `vec-inf` and running a downstream SLURM job that waits for the server to be ready before sending a request.
This example demonstrates how to launch a model server using `vec-inf`, and run a downstream SLURM job that waits for the server to become ready before querying it.
4
+
5
+
## Files
6
+
7
+
This directory contains the following:
8
+
9
+
1.[run_workflow.sh](run_workflow.sh)
10
+
Launches the model server and submits the downstream job with a dependency, so it starts only after the server job begins running.
11
+
12
+
2.[downstream_job.sbatch](downstream_job.sbatch)
13
+
A SLURM job script that runs the downstream logic (e.g., prompting the model).
14
+
15
+
3.[run_downstream.py](run_downstream.py)
16
+
A Python script that waits until the inference server is ready, then sends a request using the OpenAI-compatible API.
17
+
18
+
## What to update
19
+
20
+
Before running this example, update the following in [downstream_job.sbatch](downstream_job.sbatch):
0 commit comments