Update docs

XkunW · XkunW · commit 759aa41ecc7b · 2025-03-20T15:07:37.000-04:00
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
 ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
 
 ## Installation
 If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
 
 ### `launch` command
 
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
-the user to send requests for inference.
+The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
 
 We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
 
@@ -36,7 +35,7 @@ You should see an output like the following:
 
 #### Overrides
 
-Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
+Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
 overriden. For example, if `qos` is to be overriden:
 
 ```bash
@@ -46,11 +45,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
 #### Custom models
 
 You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
-* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
-* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
-Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
-should be specified in that config file.
+* You should specify your model configuration by:
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
 Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -64,7 +63,7 @@ models:
     model_family: Qwen2.5
     model_variant: 7B-Instruct-1M
     model_type: LLM
-    num_gpus: 2
+    gpus_per_node: 1
     num_nodes: 1
     vocab_size: 152064
     max_model_len: 1010000
@@ -74,9 +73,6 @@ models:
     qos: m2
     time: 08:00:00
     partition: a40
-    data_type: auto
-    venv: singularity
-    log_dir: default
     model_weights_parent_dir: /h/<username>/model-weights
 ```
 
@@ -86,7 +82,7 @@ You would then set the `VEC_INF_CONFIG` path using:
 export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
-Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
+Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`. 
 
 ### `status` command
 You can check the inference server status by providing the Slurm job ID to the `status` command:
@@ -133,6 +129,7 @@ vec-inf list
 ```
 <img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
 
+NOTE: The above screenshot does not represent the full list of models supported.
 
 You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
 ```bash
@@ -143,9 +140,39 @@ vec-inf list Meta-Llama-3.1-70B-Instruct
 `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
 
 ## Send inference requests
-Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
-> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
-
+Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
+
+```json
+{
+    "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
+    "choices": [
+        {
+            "finish_reason":"stop",
+            "index":0,
+            "logprobs":null,
+            "message": {
+                "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
+                "role":"assistant",
+                "function_call":null,
+                "tool_calls":[],
+                "reasoning_content":null
+            },
+            "stop_reason":null
+        }
+    ],
+    "created":1742496683,
+    "model":"Meta-Llama-3.1-8B-Instruct",
+    "object":"chat.completion",
+    "system_fingerprint":null,
+    "usage": {
+        "completion_tokens":66,
+        "prompt_tokens":32,
+        "total_tokens":98,
+        "prompt_tokens_details":null
+    },
+    "prompt_logprobs":null
+}
+```
 **NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
 
 ## SSH tunnel from your local device
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -11,7 +11,7 @@ user_guide
 
 ```
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
 
 ## Installation
 
diff --git a/docs/source/user_guide.md b/docs/source/user_guide.md
@@ -4,8 +4,7 @@
 
 ### `launch` command
 
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
-the user to send requests for inference.
+The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
 
 We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
 
@@ -18,8 +17,7 @@ You should see an output like the following:
 
 #### Overrides
 
-Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
-overriden. For example, if `qos` is to be overriden:
+Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be overriden. For example, if `qos` is to be overriden:
 
 ```bash
 vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
@@ -28,11 +26,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
 #### Custom models
 
 You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
-* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
-* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
-Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
-should be specified in that config file.
+* You should specify your model configuration by:
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
 Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -46,7 +44,7 @@ models:
     model_family: Qwen2.5
     model_variant: 7B-Instruct-1M
     model_type: LLM
-    num_gpus: 2
+    num_gpus: 1
     num_nodes: 1
     vocab_size: 152064
     max_model_len: 1010000
@@ -56,9 +54,6 @@ models:
     qos: m2
     time: 08:00:00
     partition: a40
-    data_type: auto
-    venv: singularity
-    log_dir: default
     model_weights_parent_dir: /h/<username>/model-weights
 ```
 
@@ -68,7 +63,7 @@ You would then set the `VEC_INF_CONFIG` path using:
 export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
-Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
+Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`. 
 
 ### `status` command
 
@@ -136,28 +131,37 @@ vec-inf list Meta-Llama-3.1-70B-Instruct
 
 ## Send inference requests
 
-Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
+Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
 
 ```json
 {
-    "id": "cmpl-c08d8946224747af9cce9f4d9f36ceb3",
-    "object": "text_completion",
-    "created": 1725394970,
-    "model": "Meta-Llama-3.1-8B-Instruct",
+    "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
     "choices": [
         {
-            "index": 0,
-            "text": " is a question that many people may wonder. The answer is, of course, Ottawa. But if",
-            "logprobs": null,
-            "finish_reason": "length",
-            "stop_reason": null
+            "finish_reason":"stop",
+            "index":0,
+            "logprobs":null,
+            "message": {
+                "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
+                "role":"assistant",
+                "function_call":null,
+                "tool_calls":[],
+                "reasoning_content":null
+            },
+            "stop_reason":null
         }
     ],
+    "created":1742496683,
+    "model":"Meta-Llama-3.1-8B-Instruct",
+    "object":"chat.completion",
+    "system_fingerprint":null,
     "usage": {
-        "prompt_tokens": 8,
-        "total_tokens": 28,
-        "completion_tokens": 20
-    }
+        "completion_tokens":66,
+        "prompt_tokens":32,
+        "total_tokens":98,
+        "prompt_tokens_details":null
+    },
+    "prompt_logprobs":null
 }
 ```