VectorInstitute
diff --git a/‎README.md‎
Lines changed: 13 additions & 6 deletions b/‎README.md‎
Lines changed: 13 additions & 6 deletions
diff --git a/‎docs/index.md‎
Lines changed: 6 additions & 1 deletion b/‎docs/index.md‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/user_guide.md‎
Lines changed: 55 additions & 5 deletions b/‎docs/user_guide.md‎
Lines changed: 55 additions & 5 deletions
diff --git a/‎examples/slurm_dependency/run_downstream.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/slurm_dependency/run_downstream.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/test_imports.py‎
Lines changed: 3 additions & 2 deletions b/‎tests/test_imports.py‎
Lines changed: 3 additions & 2 deletions
@@ -10,7 +10,7 @@
 [![vLLM](https://img.shields.io/badge/vllm-0.8.5.post1-blue)](https://docs.vllm.ai/en/v0.8.5.post1/index.html)
 ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
 
 ## Installation
 If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -20,6 +20,11 @@ pip install vec-inf
 ```
 Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
 
+If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
+* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
+* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
+* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.
+
 ## Usage
 
 Vector Inference provides 2 user interfaces, a CLI and an API
@@ -61,7 +66,8 @@ You can also launch your own custom model as long as the model architecture is [
 * Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
 * You should specify your model configuration by:
-  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
   * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
@@ -89,10 +95,10 @@ models:
       --compilation-config: 3
 ```
 
-You would then set the `VEC_INF_CONFIG` path using:
+You would then set the `VEC_INF_MODEL_CONFIG` path using:
 
 ```bash
-export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
+export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
 **NOTE**
@@ -103,10 +109,11 @@ export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 
 #### Other commands
 
-* `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
+* `batch-launch`: Launch multiple model inference servers at once, currently ONLY single node models supported,
+* `status`: Check the model status by providing its Slurm job ID.
 * `metrics`: Streams performance metrics to the console.
 * `shutdown`: Shutdown a model by providing its Slurm job ID.
-* `list`: List all available model names, or view the default/cached configuration of a specific model, `--json-mode` supported.
+* `list`: List all available model names, or view the default/cached configuration of a specific model.
 * `cleanup`: Remove old log directories. You can filter by `--model-family`, `--model-name`, `--job-id`, and/or `--before-job-id`. Use `--dry-run` to preview what would be deleted.
 
 For more details on the usage of these commands, refer to the [User Guide](https://vectorinstitute.github.io/vector-inference/user_guide/)
 
@@ -1,6 +1,6 @@
 # Vector Inference: Easy inference on Slurm clusters
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
 
 ## Installation
 
@@ -11,3 +11,8 @@ pip install vec-inf
 ```
 
 Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
+
+If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
+* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
+* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
+* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.
@@ -4,7 +4,7 @@
 
 ### `launch` command
 
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
+The `launch` command allows users to launch a OpenAI-compatible model inference server as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
 
 We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
 
@@ -58,7 +58,8 @@ You can also launch your own custom model as long as the model architecture is [
 * Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
 * You should specify your model configuration by:
-  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
   * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
@@ -85,10 +86,10 @@ models:
       --max-num-seqs: 256
 ```
 
-You would then set the `VEC_INF_CONFIG` path using:
+You would then set the `VEC_INF_MODEL_CONFIG` path using:
 
 ```bash
-export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
+export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
 **NOTE**
@@ -97,6 +98,53 @@ export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 * For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`.
 * Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
 
+### `batch-launch` command
+
+The `batch-launch` command allows users to launch multiple inference servers at once, here is an example of launching 2 models:
+
+```bash
+vec-inf batch-launch DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-PRM-7B
+```
+
+You should see an output like the following:
+
+```
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Job Config     ┃ Value                                                                   ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ Slurm Job ID   │ 17480109                                                                │
+│ Slurm Job Name │ BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5-Math-PRM-7B                   │
+│ Model Name     │ DeepSeek-R1-Distill-Qwen-7B                                             │
+│ Partition      │   a40                                                                   │
+│ QoS            │   m2                                                                    │
+│ Time Limit     │   08:00:00                                                              │
+│ Num Nodes      │   1                                                                     │
+│ GPUs/Node      │   1                                                                     │
+│ CPUs/Task      │   16                                                                    │
+│ Memory/Node    │   64G                                                                   │
+│ Log Directory  │   /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │
+│ Model Name     │ Qwen2.5-Math-PRM-7B                                                     │
+│ Partition      │   a40                                                                   │
+│ QoS            │   m2                                                                    │
+│ Time Limit     │   08:00:00                                                              │
+│ Num Nodes      │   1                                                                     │
+│ GPUs/Node      │   1                                                                     │
+│ CPUs/Task      │   16                                                                    │
+│ Memory/Node    │   64G                                                                   │
+│ Log Directory  │   /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │
+└────────────────┴─────────────────────────────────────────────────────────────────────────┘
+```
+
+The inference servers will begin launching only after all requested resources have been allocated, preventing resource waste. Unlike the `launch` command, `batch-launch` does not accept additional launch parameters from the command line. Users must either:
+
+- Specify a batch launch configuration file using the `--batch-config` option, or
+- Ensure model launch configurations are available at the default location (cached config or user-defined `VEC_INF_CONFIG`)
+
+Since batch launches use heterogeneous jobs, users can request different partitions and resource amounts for each model. After launch, you can monitor individual servers using the standard commands (`status`, `metrics`, etc.) by providing the specific Slurm job ID for each server (e.g. 17480109+0, 17480109+1).
+
+**NOTE**
+* Currently only models that can fit on a single node (regardless of the node type) is supported, multi-node launches will be available in a future update.
+
 ### `status` command
 
 You can check the inference server status by providing the Slurm job ID to the `status` command:
@@ -138,7 +186,9 @@ There are 5 possible states:
 * **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
 * **SHUTDOWN**: Inference server is shutdown/cancelled.
 
-Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
+**Note**
+* The base URL is only available when model is in `READY` state.
+* For servers launched with `batch-launch`, the job ID should follow the format of "MAIN_JOB_ID+OFFSET" (e.g. 17480109+0, 17480109+1).
 
 ### `metrics` command
 
 
@@ -9,7 +9,7 @@
 
 if len(sys.argv) < 2:
     raise ValueError("Expected server job ID as the first argument.")
-job_id = int(sys.argv[1])
+job_id = sys.argv[1]
 
 vi_client = VecInfClient()
 print(f"Waiting for SLURM job {job_id} to be ready...")
 
@@ -22,11 +22,12 @@ def test_imports(self):
             import vec_inf.client._exceptions
             import vec_inf.client._helper
             import vec_inf.client._slurm_script_generator
+            import vec_inf.client._slurm_templates
+            import vec_inf.client._slurm_vars
             import vec_inf.client._utils
             import vec_inf.client.api
             import vec_inf.client.config
-            import vec_inf.client.models
-            import vec_inf.client.slurm_vars  # noqa: F401
+            import vec_inf.client.models  # noqa: F401
 
         except ImportError as e:
             pytest.fail(f"Import failed: {e}")