Skip to content

Commit df4edb8

Browse files
authored
Merge branch 'main' into clean_logs
2 parents df4f14e + 645223b commit df4edb8

File tree

21 files changed

+990
-46
lines changed

21 files changed

+990
-46
lines changed

.github/workflows/docker.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ jobs:
4545
images: vectorinstitute/vector-inference
4646

4747
- name: Build and push Docker image
48-
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1
48+
uses: docker/build-push-action@1dc73863535b631f98b2378be8619f83b136f4a0
4949
with:
5050
context: .
5151
file: ./Dockerfile

.github/workflows/unit_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ jobs:
7272
uv run pytest tests/test_imports.py
7373
7474
- name: Upload coverage to Codecov
75-
uses: codecov/codecov-action@v5.4.2
75+
uses: codecov/codecov-action@v5.4.3
7676
with:
7777
token: ${{ secrets.CODECOV_TOKEN }}
7878
file: ./coverage.xml

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ repos:
1717
- id: check-toml
1818

1919
- repo: https://github.com/astral-sh/ruff-pre-commit
20-
rev: 'v0.11.8'
20+
rev: 'v0.11.11'
2121
hooks:
2222
- id: ruff
2323
args: [--fix, --exit-non-zero-on-fix]

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
[![code checks](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml)
88
[![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml)
99
[![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/main/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/main)
10+
[![vLLM](https://img.shields.io/badge/vllm-0.8.5.post1-blue)](https://docs.vllm.ai/en/v0.8.5.post1/index.html)
1011
![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
1112

1213
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
@@ -17,7 +18,7 @@ If you are using the Vector cluster environment, and you don't need any customiz
1718
```bash
1819
pip install vec-inf
1920
```
20-
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package
21+
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
2122

2223
## Usage
2324

@@ -85,7 +86,7 @@ models:
8586
vllm_args:
8687
--max-model-len: 1010000
8788
--max-num-seqs: 256
88-
--compilation-confi: 3
89+
--compilation-config: 3
8990
```
9091
9192
You would then set the `VEC_INF_CONFIG` path using:
@@ -94,7 +95,11 @@ You would then set the `VEC_INF_CONFIG` path using:
9495
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
9596
```
9697

97-
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
98+
**NOTE**
99+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
100+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
101+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
102+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
98103

99104
#### Other commands
100105

@@ -161,8 +166,9 @@ Once the inference server is ready, you can start sending in inference requests.
161166
},
162167
"prompt_logprobs":null
163168
}
169+
164170
```
165-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
171+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`.
166172

167173
## SSH tunnel from your local device
168174
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ If you are using the Vector cluster environment, and you don't need any customiz
1010
pip install vec-inf
1111
```
1212

13-
Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package.
13+
Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.

docs/user_guide.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,11 @@ You would then set the `VEC_INF_CONFIG` path using:
9191
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
9292
```
9393

94-
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/config.py) for details.
94+
**NOTE**
95+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/config.py) for details.
96+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments. The default parallel size for any parallelization defaults to 1, so none of the sizes were set specifically in this example.
97+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`.
98+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
9599

96100
### `status` command
97101

@@ -254,8 +258,7 @@ Once the inference server is ready, you can start sending in inference requests.
254258
}
255259
```
256260

257-
258-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
261+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`.
259262

260263
## SSH tunnel from your local device
261264

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
- [`logits.py`](logits/logits.py): Python example of getting logits from hosted model.
1010
- [`api`](api): Examples for using the Python API
1111
- [`basic_usage.py`](api/basic_usage.py): Basic Python example demonstrating the Vector Inference API
12+
- [`slurm_dependency`](slurm_dependency): Example of launching a model with `vec-inf` and running a downstream SLURM job that waits for the server to be ready before sending a request.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# SLURM Dependency Workflow Example
2+
3+
This example demonstrates how to launch a model server using `vec-inf`, and run a downstream SLURM job that waits for the server to become ready before querying it.
4+
5+
## Files
6+
7+
This directory contains the following:
8+
9+
1. [run_workflow.sh](run_workflow.sh)
10+
Launches the model server and submits the downstream job with a dependency, so it starts only after the server job begins running.
11+
12+
2. [downstream_job.sbatch](downstream_job.sbatch)
13+
A SLURM job script that runs the downstream logic (e.g., prompting the model).
14+
15+
3. [run_downstream.py](run_downstream.py)
16+
A Python script that waits until the inference server is ready, then sends a request using the OpenAI-compatible API.
17+
18+
## What to update
19+
20+
Before running this example, update the following in [downstream_job.sbatch](downstream_job.sbatch):
21+
22+
- `--job-name`, `--output`, and `--error` paths
23+
- Virtual environment path in the `source` line
24+
- SLURM resource configuration (e.g., partition, memory, GPU)
25+
26+
Also update the model name in [run_downstream.py](run_downstream.py) to match what you're launching.
27+
28+
## Running the example
29+
30+
First, activate a virtual environment where `vec-inf` is installed. Then, from this directory, run:
31+
32+
```bash
33+
bash run_workflow.sh
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=Meta-Llama-3.1-8B-Instruct-downstream
3+
#SBATCH --partition=a40
4+
#SBATCH --qos=m2
5+
#SBATCH --time=08:00:00
6+
#SBATCH --nodes=1
7+
#SBATCH --gpus-per-node=1
8+
#SBATCH --cpus-per-task=4
9+
#SBATCH --mem=8G
10+
#SBATCH --output=$HOME/.vec-inf-logs/Meta-Llama-3.1-8B-Instruct-downstream.%j.out
11+
#SBATCH --error=$HOME/.vec-inf-logs/Meta-Llama-3.1-8B-Instruct-downstream.%j.err
12+
13+
# Activate your environment
14+
# TODO: update this path to match your venv location
15+
source $HOME/vector-inference/.venv/bin/activate
16+
17+
# Wait for the server to be ready using the job ID passed as CLI arg
18+
python run_downstream.py "$SERVER_JOB_ID"
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
"""Example script to query a launched model via the OpenAI-compatible API."""
2+
3+
import sys
4+
5+
from openai import OpenAI
6+
7+
from vec_inf.client import VecInfClient
8+
9+
10+
if len(sys.argv) < 2:
11+
raise ValueError("Expected server job ID as the first argument.")
12+
job_id = int(sys.argv[1])
13+
14+
vi_client = VecInfClient()
15+
print(f"Waiting for SLURM job {job_id} to be ready...")
16+
status = vi_client.wait_until_ready(slurm_job_id=job_id)
17+
print(f"Server is ready at {status.base_url}")
18+
19+
api_client = OpenAI(base_url=status.base_url, api_key="EMPTY")
20+
resp = api_client.completions.create(
21+
model="Meta-Llama-3.1-8B-Instruct",
22+
prompt="Where is the capital of Canada?",
23+
max_tokens=20,
24+
)
25+
26+
print(resp)

0 commit comments

Comments
 (0)