Skip to content

Commit afa664f

Browse files
authored
Merge pull request #100 from VectorInstitute/feature/decouple_from_v
* Decouple VI from Vaughan cluster, all cluster-related (slurm) variables have been decoupled from code base into a dedicated file * Update dynamic slurm script generation to use a template instead of hard-coded strings, add slurm configuration to generated script to enhance reusability * Parse vLLM engine args dynamically instead of processing selected arguments * Updated module and function docstrings * Updated docs, bumped version
2 parents f438208 + 900f14b commit afa664f

31 files changed

+2700
-1382
lines changed

README.md

Lines changed: 41 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,13 @@
33
----------------------------------------------------
44

55
[![PyPI](https://img.shields.io/pypi/v/vec-inf)](https://pypi.org/project/vec-inf)
6+
[![downloads](https://img.shields.io/pypi/dm/vec-inf)]
67
[![code checks](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml)
78
[![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml)
89
[![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/main/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/main)
910
![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
1011

11-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/_vars.py`](vec_inf/client/_vars.py), [`vec_inf/client/_config.py`](vec_inf/client/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
12+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
1213

1314
## Installation
1415
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -20,7 +21,9 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
2021

2122
## Usage
2223

23-
### `launch` command
24+
Vector Inference provides 2 user interfaces, a CLI and an API
25+
26+
### CLI
2427

2528
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
2629

@@ -31,18 +34,26 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
3134
```
3235
You should see an output like the following:
3336

34-
<img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/883e6a5b-8016-4837-8fdf-39097dfb18bf">
37+
<img width="600" alt="launch_image" src="https://github.com/user-attachments/assets/a72a99fd-4bf2-408e-8850-359761d96c4f">
3538

3639

3740
#### Overrides
3841

39-
Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
42+
Models that are already supported by `vec-inf` would be launched using the cached configuration (set in [slurm_vars.py](vec_inf/client/slurm_vars.py)) or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
4043
overriden. For example, if `qos` is to be overriden:
4144

4245
```bash
4346
vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
4447
```
4548

49+
To overwrite default vLLM engine arguments, you can specify the engine arguments in a comma separated string:
50+
51+
```bash
52+
vec-inf launch Meta-Llama-3.1-8B-Instruct --vllm-args '--max-model-len=65536,--compilation-config=3'
53+
```
54+
55+
For the full list of vLLM engine arguments, you can find them [here](https://docs.vllm.ai/en/stable/serving/engine_args.html), make sure you select the correct vLLM version.
56+
4657
#### Custom models
4758

4859
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
@@ -67,14 +78,14 @@ models:
6778
gpus_per_node: 1
6879
num_nodes: 1
6980
vocab_size: 152064
70-
max_model_len: 1010000
71-
max_num_seqs: 256
72-
pipeline_parallelism: true
73-
enforce_eager: false
7481
qos: m2
7582
time: 08:00:00
7683
partition: a40
7784
model_weights_parent_dir: /h/<username>/model-weights
85+
vllm_args:
86+
--max-model-len: 1010000
87+
--max-num-seqs: 256
88+
--compilation-confi: 3
7889
```
7990
8091
You would then set the `VEC_INF_CONFIG` path using:
@@ -83,68 +94,40 @@ You would then set the `VEC_INF_CONFIG` path using:
8394
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
8495
```
8596

86-
Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
87-
88-
### `status` command
89-
You can check the inference server status by providing the Slurm job ID to the `status` command:
90-
```bash
91-
vec-inf status 15373800
92-
```
93-
94-
If the server is pending for resources, you should see an output like this:
95-
96-
<img width="400" alt="status_pending_img" src="https://github.com/user-attachments/assets/b659c302-eae1-4560-b7a9-14eb3a822a2f">
97-
98-
When the server is ready, you should see an output like this:
97+
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
9998

100-
<img width="400" alt="status_ready_img" src="https://github.com/user-attachments/assets/672986c2-736c-41ce-ac7c-1fb585cdcb0d">
99+
#### Other commands
101100

102-
There are 5 possible states:
101+
* `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
102+
* `metrics`: Streams performance metrics to the console.
103+
* `shutdown`: Shutdown a model by providing its Slurm job ID.
104+
* `list`: List all available model names, or view the default/cached configuration of a specific model, `--json-mode` supported.
103105

104-
* **PENDING**: Job submitted to Slurm, but not executed yet. Job pending reason will be shown.
105-
* **LAUNCHING**: Job is running but the server is not ready yet.
106-
* **READY**: Inference server running and ready to take requests.
107-
* **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
108-
* **SHUTDOWN**: Inference server is shutdown/cancelled.
106+
For more details on the usage of these commands, refer to the [User Guide](https://vectorinstitute.github.io/vector-inference/user_guide/)
109107

110-
Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
108+
### API
111109

112-
### `metrics` command
113-
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
114-
```bash
115-
vec-inf metrics 15373800
116-
```
117-
118-
And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
119-
120-
<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/3ee143d0-1a71-4944-bbd7-4c3299bf0339">
121-
122-
### `shutdown` command
123-
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
124-
```bash
125-
vec-inf shutdown 15373800
110+
Example:
126111

127-
> Shutting down model with Slurm Job ID: 15373800
112+
```python
113+
>>> from vec_inf.api import VecInfClient
114+
>>> client = VecInfClient()
115+
>>> response = client.launch_model("Meta-Llama-3.1-8B-Instruct")
116+
>>> job_id = response.slurm_job_id
117+
>>> status = client.get_status(job_id)
118+
>>> if status.status == ModelStatus.READY:
119+
... print(f"Model is ready at {status.base_url}")
120+
>>> client.shutdown_model(job_id)
128121
```
129122

130-
### `list` command
131-
You call view the full list of available models by running the `list` command:
132-
```bash
133-
vec-inf list
134-
```
135-
<img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
136-
137-
NOTE: The above screenshot does not represent the full list of models supported.
123+
For details on the usage of the API, refer to the [API Reference](https://vectorinstitute.github.io/vector-inference/api/)
138124

139-
You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
140-
```bash
141-
vec-inf list Meta-Llama-3.1-70B-Instruct
142-
```
143-
<img width="500" alt="list_model_img" src="https://github.com/user-attachments/assets/34e53937-2d86-443e-85f6-34e408653ddb">
125+
## Check Job Configuration
144126

145-
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
127+
With every model launch, a Slurm script will be generated dynamically based on the job and model configuration. Once the Slurm job is queued, the generated Slurm script will be moved to the log directory for reproducibility, located at `$log_dir/$model_family/$model_name.$slurm_job_id/$model_name.$slurm_job_id.slurm`. In the same directory you can also find a JSON file with the same name that captures the launch configuration, and will have an entry of server URL once the server is ready.
146128

147129
## Send inference requests
130+
148131
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
149132

150133
```json

docs/api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This section documents the Python API for vector-inference.
1212

1313
## Data Models
1414

15-
::: vec_inf.client._models
15+
::: vec_inf.client.models
1616
options:
1717
show_root_heading: true
1818
members: true

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Vector Inference: Easy inference on Slurm clusters
22

3-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/_vars.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/_vars.py), [`vec_inf/client/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/_config.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
3+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
44

55
## Installation
66

0 commit comments

Comments
 (0)