Skip to content

Commit aa7800f

Browse files
authored
Merge pull request #123 from VectorInstitute/feature/env_config
Decouple cluster-related environment variables from package and host them in a configuration file instead.
2 parents b3db973 + 6a3a24c commit aa7800f

File tree

12 files changed

+186
-116
lines changed

12 files changed

+186
-116
lines changed

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
[![vLLM](https://img.shields.io/badge/vllm-0.8.5.post1-blue)](https://docs.vllm.ai/en/v0.8.5.post1/index.html)
1111
![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
1212

13-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
13+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
1414

1515
## Installation
1616
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -20,6 +20,11 @@ pip install vec-inf
2020
```
2121
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
2222

23+
If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
24+
* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
25+
* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
26+
* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.
27+
2328
## Usage
2429

2530
Vector Inference provides 2 user interfaces, a CLI and an API
@@ -61,7 +66,8 @@ You can also launch your own custom model as long as the model architecture is [
6166
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
6267
* Your model weights directory should contain HuggingFace format weights.
6368
* You should specify your model configuration by:
64-
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
69+
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
70+
* Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
6571
* Using launch command options to specify your model setup.
6672
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
6773

@@ -89,10 +95,10 @@ models:
8995
--compilation-config: 3
9096
```
9197
92-
You would then set the `VEC_INF_CONFIG` path using:
98+
You would then set the `VEC_INF_MODEL_CONFIG` path using:
9399

94100
```bash
95-
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
101+
export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
96102
```
97103

98104
**NOTE**

docs/index.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Vector Inference: Easy inference on Slurm clusters
22

3-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
3+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
44

55
## Installation
66

@@ -11,3 +11,8 @@ pip install vec-inf
1111
```
1212

1313
Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
14+
15+
If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
16+
* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
17+
* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
18+
* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.

docs/user_guide.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,8 @@ You can also launch your own custom model as long as the model architecture is [
5858
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
5959
* Your model weights directory should contain HuggingFace format weights.
6060
* You should specify your model configuration by:
61-
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
61+
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
62+
* Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
6263
* Using launch command options to specify your model setup.
6364
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
6465

@@ -85,10 +86,10 @@ models:
8586
--max-num-seqs: 256
8687
```
8788
88-
You would then set the `VEC_INF_CONFIG` path using:
89+
You would then set the `VEC_INF_MODEL_CONFIG` path using:
8990

9091
```bash
91-
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
92+
export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
9293
```
9394

9495
**NOTE**

tests/test_imports.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@ def test_imports(self):
2323
import vec_inf.client._helper
2424
import vec_inf.client._slurm_script_generator
2525
import vec_inf.client._slurm_templates
26+
import vec_inf.client._slurm_vars
2627
import vec_inf.client._utils
2728
import vec_inf.client.api
2829
import vec_inf.client.config
29-
import vec_inf.client.models
30-
import vec_inf.client.slurm_vars # noqa: F401
30+
import vec_inf.client.models # noqa: F401
3131

3232
except ImportError as e:
3333
pytest.fail(f"Import failed: {e}")

tests/vec_inf/client/test_utils.py

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -151,9 +151,11 @@ def test_load_config_default_only():
151151

152152
def test_load_config_with_user_override(tmp_path, monkeypatch):
153153
"""Test user config overriding default values."""
154-
# Create user config with override and new model
155-
user_config = tmp_path / "user_config.yaml"
156-
user_config.write_text("""\
154+
# Create user config directory and file
155+
user_config_dir = tmp_path / "user_config_dir"
156+
user_config_dir.mkdir()
157+
user_config_file = user_config_dir / "models.yaml"
158+
user_config_file.write_text("""\
157159
models:
158160
c4ai-command-r-plus-08-2024:
159161
gpus_per_node: 8
@@ -168,7 +170,7 @@ def test_load_config_with_user_override(tmp_path, monkeypatch):
168170
""")
169171

170172
with monkeypatch.context() as m:
171-
m.setenv("VEC_INF_CONFIG", str(user_config))
173+
m.setenv("VEC_INF_CONFIG_DIR", str(user_config_dir))
172174
configs = load_config()
173175
config_map = {m.model_name: m for m in configs}
174176

@@ -188,8 +190,11 @@ def test_load_config_with_user_override(tmp_path, monkeypatch):
188190

189191
def test_load_config_invalid_user_model(tmp_path):
190192
"""Test validation of user-provided model configurations."""
191-
invalid_config = tmp_path / "bad_config.yaml"
192-
invalid_config.write_text("""\
193+
# Create user config directory and file
194+
invalid_config_dir = tmp_path / "bad_config_dir"
195+
invalid_config_dir.mkdir()
196+
invalid_config_file = invalid_config_dir / "models.yaml"
197+
invalid_config_file.write_text("""\
193198
models:
194199
invalid-model:
195200
model_family: ""
@@ -200,7 +205,7 @@ def test_load_config_invalid_user_model(tmp_path):
200205

201206
with (
202207
pytest.raises(ValueError) as excinfo,
203-
patch.dict(os.environ, {"VEC_INF_CONFIG": str(invalid_config)}),
208+
patch.dict(os.environ, {"VEC_INF_CONFIG_DIR": str(invalid_config_dir)}),
204209
):
205210
load_config()
206211

vec_inf/client/_slurm_templates.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@
66

77
from typing import TypedDict
88

9-
from vec_inf.client.slurm_vars import (
9+
from vec_inf.client._slurm_vars import (
1010
LD_LIBRARY_PATH,
1111
SINGULARITY_IMAGE,
1212
SINGULARITY_LOAD_CMD,
13+
SINGULARITY_MODULE_NAME,
1314
VLLM_NCCL_SO_PATH,
1415
)
1516

@@ -93,14 +94,14 @@ class SlurmScriptTemplate(TypedDict):
9394
},
9495
"singularity_setup": [
9596
SINGULARITY_LOAD_CMD,
96-
f"singularity exec {SINGULARITY_IMAGE} ray stop",
97+
f"{SINGULARITY_MODULE_NAME} exec {SINGULARITY_IMAGE} ray stop",
9798
],
9899
"imports": "source {src_dir}/find_port.sh",
99100
"env_vars": [
100101
f"export LD_LIBRARY_PATH={LD_LIBRARY_PATH}",
101102
f"export VLLM_NCCL_SO_PATH={VLLM_NCCL_SO_PATH}",
102103
],
103-
"singularity_command": f"singularity exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
104+
"singularity_command": f"{SINGULARITY_MODULE_NAME} exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
104105
"activate_venv": "source {venv}/bin/activate",
105106
"server_setup": {
106107
"single_node": [
@@ -240,7 +241,7 @@ class BatchModelLaunchScriptTemplate(TypedDict):
240241
' "$json_path" > temp_{model_name}.json \\',
241242
' && mv temp_{model_name}.json "$json_path"\n',
242243
],
243-
"singularity_command": f"singularity exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
244+
"singularity_command": f"{SINGULARITY_MODULE_NAME} exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
244245
"launch_cmd": [
245246
"vllm serve {model_weights_path} \\",
246247
" --served-model-name {model_name} \\",

vec_inf/client/_slurm_vars.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""Slurm cluster configuration variables."""
2+
3+
import os
4+
import warnings
5+
from pathlib import Path
6+
from typing import Any, TypeAlias
7+
8+
import yaml
9+
from typing_extensions import Literal
10+
11+
12+
CACHED_CONFIG_DIR = Path("/model-weights/vec-inf-shared")
13+
14+
15+
def load_env_config() -> dict[str, Any]:
16+
"""Load the environment configuration."""
17+
18+
def load_yaml_config(path: Path) -> dict[str, Any]:
19+
"""Load YAML config with error handling."""
20+
try:
21+
with path.open() as f:
22+
return yaml.safe_load(f) or {}
23+
except FileNotFoundError as err:
24+
raise FileNotFoundError(f"Could not find config: {path}") from err
25+
except yaml.YAMLError as err:
26+
raise ValueError(f"Error parsing YAML config at {path}: {err}") from err
27+
28+
cached_config_path = CACHED_CONFIG_DIR / "environment.yaml"
29+
default_path = (
30+
cached_config_path
31+
if cached_config_path.exists()
32+
else Path(__file__).resolve().parent.parent / "config" / "environment.yaml"
33+
)
34+
config = load_yaml_config(default_path)
35+
36+
user_path = os.getenv("VEC_INF_CONFIG_DIR")
37+
if user_path:
38+
user_path_obj = Path(user_path, "environment.yaml")
39+
if user_path_obj.exists():
40+
user_config = load_yaml_config(user_path_obj)
41+
config.update(user_config)
42+
else:
43+
warnings.warn(
44+
f"WARNING: Could not find user config directory: {user_path}, revert to default config located at {default_path}",
45+
UserWarning,
46+
stacklevel=2,
47+
)
48+
49+
return config
50+
51+
52+
_config = load_env_config()
53+
54+
# Extract path values
55+
LD_LIBRARY_PATH = _config["paths"]["ld_library_path"]
56+
SINGULARITY_IMAGE = _config["paths"]["image_path"]
57+
VLLM_NCCL_SO_PATH = _config["paths"]["vllm_nccl_so_path"]
58+
59+
# Extract containerization info
60+
SINGULARITY_LOAD_CMD = _config["containerization"]["module_load_cmd"]
61+
SINGULARITY_MODULE_NAME = _config["containerization"]["module_name"]
62+
63+
# Extract limits
64+
MAX_GPUS_PER_NODE = _config["limits"]["max_gpus_per_node"]
65+
MAX_NUM_NODES = _config["limits"]["max_num_nodes"]
66+
MAX_CPUS_PER_TASK = _config["limits"]["max_cpus_per_task"]
67+
68+
# Create dynamic Literal types
69+
QOS: TypeAlias = Literal[tuple(_config["allowed_values"]["qos"])] # type: ignore[valid-type]
70+
PARTITION: TypeAlias = Literal[tuple(_config["allowed_values"]["partition"])] # type: ignore[valid-type]
71+
72+
# Extract default arguments
73+
DEFAULT_ARGS = _config["default_args"]

vec_inf/client/_utils.py

Lines changed: 44 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515
import yaml
1616

1717
from vec_inf.client._client_vars import MODEL_READY_SIGNATURE
18+
from vec_inf.client._slurm_vars import CACHED_CONFIG_DIR
1819
from vec_inf.client.config import ModelConfig
1920
from vec_inf.client.models import ModelStatus
20-
from vec_inf.client.slurm_vars import CACHED_CONFIG
2121

2222

2323
def run_bash_command(command: str) -> tuple[str, str]:
@@ -217,44 +217,61 @@ def load_yaml_config(path: Path) -> dict[str, Any]:
217217
except yaml.YAMLError as err:
218218
raise ValueError(f"Error parsing YAML config at {path}: {err}") from err
219219

220-
# 1. If config_path is given, use only that
221-
if config_path:
222-
config = load_yaml_config(Path(config_path))
220+
def process_config(config: dict[str, Any]) -> list[ModelConfig]:
221+
"""Process the config based on the config type."""
223222
return [
224223
ModelConfig(model_name=name, **model_data)
225224
for name, model_data in config.get("models", {}).items()
226225
]
227226

227+
def resolve_config_path_from_env_var() -> Path | None:
228+
"""Resolve the config path from the environment variable."""
229+
config_dir = os.getenv("VEC_INF_CONFIG_DIR")
230+
config_path = os.getenv("VEC_INF_MODEL_CONFIG")
231+
if config_path:
232+
return Path(config_path)
233+
if config_dir:
234+
return Path(config_dir, "models.yaml")
235+
return None
236+
237+
def update_config(
238+
config: dict[str, Any], user_config: dict[str, Any]
239+
) -> dict[str, Any]:
240+
"""Update the config with the user config."""
241+
for name, data in user_config.get("models", {}).items():
242+
if name in config.get("models", {}):
243+
config["models"][name].update(data)
244+
else:
245+
config.setdefault("models", {})[name] = data
246+
247+
return config
248+
249+
# 1. If config_path is given, use only that
250+
if config_path:
251+
config = load_yaml_config(Path(config_path))
252+
return process_config(config)
253+
228254
# 2. Otherwise, load default config
229255
default_path = (
230-
CACHED_CONFIG
231-
if CACHED_CONFIG.exists()
256+
CACHED_CONFIG_DIR / "models_latest.yaml"
257+
if CACHED_CONFIG_DIR.exists()
232258
else Path(__file__).resolve().parent.parent / "config" / "models.yaml"
233259
)
234260
config = load_yaml_config(default_path)
235261

236262
# 3. If user config exists, merge it
237-
user_path = os.getenv("VEC_INF_CONFIG")
238-
if user_path:
239-
user_path_obj = Path(user_path)
240-
if user_path_obj.exists():
241-
user_config = load_yaml_config(user_path_obj)
242-
for name, data in user_config.get("models", {}).items():
243-
if name in config.get("models", {}):
244-
config["models"][name].update(data)
245-
else:
246-
config.setdefault("models", {})[name] = data
247-
else:
248-
warnings.warn(
249-
f"WARNING: Could not find user config: {user_path}, revert to default config located at {default_path}",
250-
UserWarning,
251-
stacklevel=2,
252-
)
253-
254-
return [
255-
ModelConfig(model_name=name, **model_data)
256-
for name, model_data in config.get("models", {}).items()
257-
]
263+
user_path = resolve_config_path_from_env_var()
264+
if user_path and user_path.exists():
265+
user_config = load_yaml_config(user_path)
266+
config = update_config(config, user_config)
267+
elif user_path:
268+
warnings.warn(
269+
f"WARNING: Could not find user config: {str(user_path)}, revert to default config located at {default_path}",
270+
UserWarning,
271+
stacklevel=2,
272+
)
273+
274+
return process_config(config)
258275

259276

260277
def parse_launch_output(output: str) -> tuple[str, dict[str, str]]:

vec_inf/client/config.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from pydantic import BaseModel, ConfigDict, Field
1111
from typing_extensions import Literal
1212

13-
from vec_inf.client.slurm_vars import (
13+
from vec_inf.client._slurm_vars import (
1414
DEFAULT_ARGS,
1515
MAX_CPUS_PER_TASK,
1616
MAX_GPUS_PER_NODE,
@@ -132,7 +132,6 @@ class ModelConfig(BaseModel):
132132
vllm_args: Optional[dict[str, Any]] = Field(
133133
default={}, description="vLLM engine arguments"
134134
)
135-
136135
model_config = ConfigDict(
137136
extra="forbid", str_strip_whitespace=True, validate_default=True, frozen=True
138137
)

0 commit comments

Comments
 (0)