|
3 | 3 | ---------------------------------------------------- |
4 | 4 |
|
5 | 5 | [](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml) |
6 | | -[](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_build.yml) |
7 | | -[](https://codecov.io/github/VectorInstitute/vector-inference) |
| 6 | +[](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_deploy.yml) |
| 7 | +[](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop) |
8 | 8 |  |
9 | 9 |
|
10 | | -This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/models/models.csv) accordingly. |
| 10 | +This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly. |
11 | 11 |
|
12 | 12 | ## Installation |
13 | 13 | If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package: |
| 14 | + |
14 | 15 | ```bash |
15 | 16 | pip install vec-inf |
16 | 17 | ``` |
17 | 18 | Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package |
18 | 19 |
|
19 | | -## Launch an inference server |
| 20 | +## Usage |
| 21 | + |
20 | 22 | ### `launch` command |
| 23 | + |
| 24 | +The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for |
| 25 | +the user to send requests for inference. |
| 26 | + |
21 | 27 | We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run: |
| 28 | + |
22 | 29 | ```bash |
23 | 30 | vec-inf launch Meta-Llama-3.1-8B-Instruct |
24 | 31 | ``` |
25 | 32 | You should see an output like the following: |
26 | 33 |
|
27 | 34 | <img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5"> |
28 | 35 |
|
29 | | -The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional parameters, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below: |
| 36 | +#### Overrides |
| 37 | + |
| 38 | +Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be |
| 39 | +overriden. For example, if `qos` is to be overriden: |
| 40 | + |
| 41 | +```bash |
| 42 | +vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos> |
| 43 | +``` |
| 44 | + |
| 45 | +#### Custom models |
| 46 | + |
| 47 | +You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below: |
30 | 48 | * Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`. |
31 | | -* Your model weights directory should contain HF format weights. |
32 | | -* The following launch parameters will conform to default value if not specified: `--max-num-seqs`, `--partition`, `--data-type`, `--venv`, `--log-dir`, `--model-weights-parent-dir`, `--pipeline-parallelism`, `--enforce-eager`. All other launch parameters need to be specified for custom models. |
33 | | -* Example for setting the model weights parent directory: `--model-weights-parent-dir /h/user_name/my_weights`. |
| 49 | +* Your model weights directory should contain HuggingFace format weights. |
| 50 | +* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG` |
| 51 | +Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model |
| 52 | +should be specified in that config file. |
34 | 53 | * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command). |
35 | 54 |
|
| 55 | +Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not |
| 56 | +supported in the default list of models using a user custom config. In this case, the model weights are assumed to be downloaded to |
| 57 | +a `model-weights` directory inside the user's home directory. The weights directory of the model follows the naming convention so it |
| 58 | +would be named `Qwen2.5-7B-Instruct-1M`. The following yaml file would need to be created, lets say it is named `/h/<username>/my-model-config.yaml`. |
| 59 | + |
| 60 | +```yaml |
| 61 | +models: |
| 62 | + Qwen2.5-7B-Instruct-1M: |
| 63 | + model_family: Qwen2.5 |
| 64 | + model_variant: 7B-Instruct-1M |
| 65 | + model_type: LLM |
| 66 | + num_gpus: 2 |
| 67 | + num_nodes: 1 |
| 68 | + vocab_size: 152064 |
| 69 | + max_model_len: 1010000 |
| 70 | + max_num_seqs: 256 |
| 71 | + pipeline_parallelism: true |
| 72 | + enforce_eager: false |
| 73 | + qos: m2 |
| 74 | + time: 08:00:00 |
| 75 | + partition: a40 |
| 76 | + data_type: auto |
| 77 | + venv: singularity |
| 78 | + log_dir: default |
| 79 | + model_weights_parent_dir: /h/<username>/model-weights |
| 80 | +``` |
| 81 | +
|
| 82 | +You would then set the `VEC_INF_CONFIG` path using: |
| 83 | + |
| 84 | +```bash |
| 85 | +export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml |
| 86 | +``` |
| 87 | + |
| 88 | +Alternatively, you can also use launch parameters to set these values instead of using a user-defined config. |
| 89 | + |
36 | 90 | ### `status` command |
37 | 91 | You can check the inference server status by providing the Slurm job ID to the `status` command: |
38 | 92 | ```bash |
|
0 commit comments