Skip to content

Commit 759aa41

Browse files
committed
Update docs
1 parent fb6e205 commit 759aa41

File tree

3 files changed

+75
-44
lines changed

3 files changed

+75
-44
lines changed

README.md

Lines changed: 43 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
[![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
99
![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
1010

11-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly.
11+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
1212

1313
## Installation
1414
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
2222

2323
### `launch` command
2424

25-
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
26-
the user to send requests for inference.
25+
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
2726

2827
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
2928

@@ -36,7 +35,7 @@ You should see an output like the following:
3635

3736
#### Overrides
3837

39-
Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
38+
Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
4039
overriden. For example, if `qos` is to be overriden:
4140

4241
```bash
@@ -46,11 +45,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
4645
#### Custom models
4746

4847
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
49-
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
48+
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
5049
* Your model weights directory should contain HuggingFace format weights.
51-
* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
52-
Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
53-
should be specified in that config file.
50+
* You should specify your model configuration by:
51+
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
52+
* Using launch command options to specify your model setup.
5453
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
5554

5655
Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -64,7 +63,7 @@ models:
6463
model_family: Qwen2.5
6564
model_variant: 7B-Instruct-1M
6665
model_type: LLM
67-
num_gpus: 2
66+
gpus_per_node: 1
6867
num_nodes: 1
6968
vocab_size: 152064
7069
max_model_len: 1010000
@@ -74,9 +73,6 @@ models:
7473
qos: m2
7574
time: 08:00:00
7675
partition: a40
77-
data_type: auto
78-
venv: singularity
79-
log_dir: default
8076
model_weights_parent_dir: /h/<username>/model-weights
8177
```
8278
@@ -86,7 +82,7 @@ You would then set the `VEC_INF_CONFIG` path using:
8682
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
8783
```
8884

89-
Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
85+
Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
9086

9187
### `status` command
9288
You can check the inference server status by providing the Slurm job ID to the `status` command:
@@ -133,6 +129,7 @@ vec-inf list
133129
```
134130
<img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
135131

132+
NOTE: The above screenshot does not represent the full list of models supported.
136133

137134
You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
138135
```bash
@@ -143,9 +140,39 @@ vec-inf list Meta-Llama-3.1-70B-Instruct
143140
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
144141

145142
## Send inference requests
146-
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
147-
> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
148-
143+
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
144+
145+
```json
146+
{
147+
"id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
148+
"choices": [
149+
{
150+
"finish_reason":"stop",
151+
"index":0,
152+
"logprobs":null,
153+
"message": {
154+
"content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
155+
"role":"assistant",
156+
"function_call":null,
157+
"tool_calls":[],
158+
"reasoning_content":null
159+
},
160+
"stop_reason":null
161+
}
162+
],
163+
"created":1742496683,
164+
"model":"Meta-Llama-3.1-8B-Instruct",
165+
"object":"chat.completion",
166+
"system_fingerprint":null,
167+
"usage": {
168+
"completion_tokens":66,
169+
"prompt_tokens":32,
170+
"total_tokens":98,
171+
"prompt_tokens_details":null
172+
},
173+
"prompt_logprobs":null
174+
}
175+
```
149176
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
150177

151178
## SSH tunnel from your local device

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ user_guide
1111
1212
```
1313

14-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
14+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
1515

1616
## Installation
1717

docs/source/user_guide.md

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@
44

55
### `launch` command
66

7-
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
8-
the user to send requests for inference.
7+
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
98

109
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
1110

@@ -18,8 +17,7 @@ You should see an output like the following:
1817

1918
#### Overrides
2019

21-
Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
22-
overriden. For example, if `qos` is to be overriden:
20+
Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be overriden. For example, if `qos` is to be overriden:
2321

2422
```bash
2523
vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
@@ -28,11 +26,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
2826
#### Custom models
2927

3028
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
31-
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
29+
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
3230
* Your model weights directory should contain HuggingFace format weights.
33-
* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
34-
Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
35-
should be specified in that config file.
31+
* You should specify your model configuration by:
32+
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
33+
* Using launch command options to specify your model setup.
3634
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
3735

3836
Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -46,7 +44,7 @@ models:
4644
model_family: Qwen2.5
4745
model_variant: 7B-Instruct-1M
4846
model_type: LLM
49-
num_gpus: 2
47+
num_gpus: 1
5048
num_nodes: 1
5149
vocab_size: 152064
5250
max_model_len: 1010000
@@ -56,9 +54,6 @@ models:
5654
qos: m2
5755
time: 08:00:00
5856
partition: a40
59-
data_type: auto
60-
venv: singularity
61-
log_dir: default
6257
model_weights_parent_dir: /h/<username>/model-weights
6358
```
6459
@@ -68,7 +63,7 @@ You would then set the `VEC_INF_CONFIG` path using:
6863
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
6964
```
7065

71-
Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
66+
Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
7267

7368
### `status` command
7469

@@ -136,28 +131,37 @@ vec-inf list Meta-Llama-3.1-70B-Instruct
136131

137132
## Send inference requests
138133

139-
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
134+
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
140135

141136
```json
142137
{
143-
"id": "cmpl-c08d8946224747af9cce9f4d9f36ceb3",
144-
"object": "text_completion",
145-
"created": 1725394970,
146-
"model": "Meta-Llama-3.1-8B-Instruct",
138+
"id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
147139
"choices": [
148140
{
149-
"index": 0,
150-
"text": " is a question that many people may wonder. The answer is, of course, Ottawa. But if",
151-
"logprobs": null,
152-
"finish_reason": "length",
153-
"stop_reason": null
141+
"finish_reason":"stop",
142+
"index":0,
143+
"logprobs":null,
144+
"message": {
145+
"content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
146+
"role":"assistant",
147+
"function_call":null,
148+
"tool_calls":[],
149+
"reasoning_content":null
150+
},
151+
"stop_reason":null
154152
}
155153
],
154+
"created":1742496683,
155+
"model":"Meta-Llama-3.1-8B-Instruct",
156+
"object":"chat.completion",
157+
"system_fingerprint":null,
156158
"usage": {
157-
"prompt_tokens": 8,
158-
"total_tokens": 28,
159-
"completion_tokens": 20
160-
}
159+
"completion_tokens":66,
160+
"prompt_tokens":32,
161+
"total_tokens":98,
162+
"prompt_tokens_details":null
163+
},
164+
"prompt_logprobs":null
161165
}
162166
```
163167

0 commit comments

Comments
 (0)