Skip to content

Commit 3b1e87b

Browse files
committed
Added some notes for gotchas
1 parent 75e0ae1 commit 3b1e87b

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ models:
8585
vllm_args:
8686
--max-model-len: 1010000
8787
--max-num-seqs: 256
88-
--compilation-confi: 3
88+
--compilation-config: 3
8989
```
9090
9191
You would then set the `VEC_INF_CONFIG` path using:
@@ -94,7 +94,11 @@ You would then set the `VEC_INF_CONFIG` path using:
9494
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
9595
```
9696

97-
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
97+
**NOTE**
98+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
99+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
100+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
101+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
98102

99103
#### Other commands
100104

@@ -161,7 +165,7 @@ Once the inference server is ready, you can start sending in inference requests.
161165
"prompt_logprobs":null
162166
}
163167
```
164-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
168+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`
165169

166170
## SSH tunnel from your local device
167171
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:

0 commit comments

Comments
 (0)