Merge pull request #123 from VectorInstitute/feature/env_config

XkunW · web-flow · commit aa7800f00144 · 2025-07-17T15:06:00.000-04:00
Decouple cluster-related environment variables from package and host them in a configuration file instead.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@
 [![vLLM](https://img.shields.io/badge/vllm-0.8.5.post1-blue)](https://docs.vllm.ai/en/v0.8.5.post1/index.html)
 ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
 
 ## Installation
 If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -20,6 +20,11 @@ pip install vec-inf
 ```
 Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
 
+If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
+* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
+* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
+* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.
+
 ## Usage
 
 Vector Inference provides 2 user interfaces, a CLI and an API
@@ -61,7 +66,8 @@ You can also launch your own custom model as long as the model architecture is [
 * Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
 * You should specify your model configuration by:
-  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
   * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
@@ -89,10 +95,10 @@ models:
       --compilation-config: 3
 ```
 
-You would then set the `VEC_INF_CONFIG` path using:
+You would then set the `VEC_INF_MODEL_CONFIG` path using:
 
 ```bash
-export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
+export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
 **NOTE**
diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,6 @@
 # Vector Inference: Easy inference on Slurm clusters
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, follow the instructions in [Installation](#installation).
 
 ## Installation
 
@@ -11,3 +11,8 @@ pip install vec-inf
 ```
 
 Otherwise, we recommend using the provided [`Dockerfile`](https://github.com/VectorInstitute/vector-inference/blob/main/Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
+
+If you'd like to use `vec-inf` on your own Slurm cluster, you would need to update the configuration files, there are 3 ways to do it:
+* Clone the repository and update the `environment.yaml` and the `models.yaml` file in [`vec_inf/config`](vec_inf/config/), then install from source by running `pip install .`.
+* The package would try to look for cached configuration files in your environment before using the default configuration. The default cached configuration directory path points to `/model-weights/vec-inf-shared`, you would need to create an `environment.yaml` and a `models.yaml` following the format of these files in [`vec_inf/config`](vec_inf/config/).
+* The package would also look for an enviroment variable `VEC_INF_CONFIG_DIR`. You can put your `environment.yaml` and `models.yaml` in a directory of your choice and set the enviroment variable `VEC_INF_CONFIG_DIR` to point to that location.
diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -58,7 +58,8 @@ You can also launch your own custom model as long as the model architecture is [
 * Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
 * You should specify your model configuration by:
-  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_MODEL_CONFIG` (This one will supersede `VEC_INF_CONFIG_DIR` if that is also set). Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Add your model configuration to the cached `models.yaml` in your cluster environment (if you have write access to the cached configuration directory).
   * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
@@ -85,10 +86,10 @@ models:
       --max-num-seqs: 256
 ```
 
-You would then set the `VEC_INF_CONFIG` path using:
+You would then set the `VEC_INF_MODEL_CONFIG` path using:
 
 ```bash
-export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
+export VEC_INF_MODEL_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
 **NOTE**
diff --git a/tests/test_imports.py b/tests/test_imports.py
@@ -23,11 +23,11 @@ def test_imports(self):
             import vec_inf.client._helper
             import vec_inf.client._slurm_script_generator
             import vec_inf.client._slurm_templates
+            import vec_inf.client._slurm_vars
             import vec_inf.client._utils
             import vec_inf.client.api
             import vec_inf.client.config
-            import vec_inf.client.models
-            import vec_inf.client.slurm_vars  # noqa: F401
+            import vec_inf.client.models  # noqa: F401
 
         except ImportError as e:
             pytest.fail(f"Import failed: {e}")
diff --git a/tests/vec_inf/client/test_utils.py b/tests/vec_inf/client/test_utils.py
@@ -151,9 +151,11 @@ def test_load_config_default_only():
 
 def test_load_config_with_user_override(tmp_path, monkeypatch):
     """Test user config overriding default values."""
-    # Create user config with override and new model
-    user_config = tmp_path / "user_config.yaml"
-    user_config.write_text("""\
+    # Create user config directory and file
+    user_config_dir = tmp_path / "user_config_dir"
+    user_config_dir.mkdir()
+    user_config_file = user_config_dir / "models.yaml"
+    user_config_file.write_text("""\
 models:
   c4ai-command-r-plus-08-2024:
     gpus_per_node: 8
@@ -168,7 +170,7 @@ def test_load_config_with_user_override(tmp_path, monkeypatch):
 """)
 
     with monkeypatch.context() as m:
-        m.setenv("VEC_INF_CONFIG", str(user_config))
+        m.setenv("VEC_INF_CONFIG_DIR", str(user_config_dir))
         configs = load_config()
         config_map = {m.model_name: m for m in configs}
 
@@ -188,8 +190,11 @@ def test_load_config_with_user_override(tmp_path, monkeypatch):
 
 def test_load_config_invalid_user_model(tmp_path):
     """Test validation of user-provided model configurations."""
-    invalid_config = tmp_path / "bad_config.yaml"
-    invalid_config.write_text("""\
+    # Create user config directory and file
+    invalid_config_dir = tmp_path / "bad_config_dir"
+    invalid_config_dir.mkdir()
+    invalid_config_file = invalid_config_dir / "models.yaml"
+    invalid_config_file.write_text("""\
 models:
   invalid-model:
     model_family: ""
@@ -200,7 +205,7 @@ def test_load_config_invalid_user_model(tmp_path):
 
     with (
         pytest.raises(ValueError) as excinfo,
-        patch.dict(os.environ, {"VEC_INF_CONFIG": str(invalid_config)}),
+        patch.dict(os.environ, {"VEC_INF_CONFIG_DIR": str(invalid_config_dir)}),
     ):
         load_config()
 
diff --git a/vec_inf/client/_slurm_templates.py b/vec_inf/client/_slurm_templates.py
@@ -6,10 +6,11 @@
 
 from typing import TypedDict
 
-from vec_inf.client.slurm_vars import (
+from vec_inf.client._slurm_vars import (
     LD_LIBRARY_PATH,
     SINGULARITY_IMAGE,
     SINGULARITY_LOAD_CMD,
+    SINGULARITY_MODULE_NAME,
     VLLM_NCCL_SO_PATH,
 )
 
@@ -93,14 +94,14 @@ class SlurmScriptTemplate(TypedDict):
     },
     "singularity_setup": [
         SINGULARITY_LOAD_CMD,
-        f"singularity exec {SINGULARITY_IMAGE} ray stop",
+        f"{SINGULARITY_MODULE_NAME} exec {SINGULARITY_IMAGE} ray stop",
     ],
     "imports": "source {src_dir}/find_port.sh",
     "env_vars": [
         f"export LD_LIBRARY_PATH={LD_LIBRARY_PATH}",
         f"export VLLM_NCCL_SO_PATH={VLLM_NCCL_SO_PATH}",
     ],
-    "singularity_command": f"singularity exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
+    "singularity_command": f"{SINGULARITY_MODULE_NAME} exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
     "activate_venv": "source {venv}/bin/activate",
     "server_setup": {
         "single_node": [
@@ -240,7 +241,7 @@ class BatchModelLaunchScriptTemplate(TypedDict):
         '    "$json_path" > temp_{model_name}.json \\',
         '    && mv temp_{model_name}.json "$json_path"\n',
     ],
-    "singularity_command": f"singularity exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
+    "singularity_command": f"{SINGULARITY_MODULE_NAME} exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE} \\",
     "launch_cmd": [
         "vllm serve {model_weights_path} \\",
         "    --served-model-name {model_name} \\",
diff --git a/vec_inf/client/_slurm_vars.py b/vec_inf/client/_slurm_vars.py
@@ -0,0 +1,73 @@
+"""Slurm cluster configuration variables."""
+
+import os
+import warnings
+from pathlib import Path
+from typing import Any, TypeAlias
+
+import yaml
+from typing_extensions import Literal
+
+
+CACHED_CONFIG_DIR = Path("/model-weights/vec-inf-shared")
+
+
+def load_env_config() -> dict[str, Any]:
+    """Load the environment configuration."""
+
+    def load_yaml_config(path: Path) -> dict[str, Any]:
+        """Load YAML config with error handling."""
+        try:
+            with path.open() as f:
+                return yaml.safe_load(f) or {}
+        except FileNotFoundError as err:
+            raise FileNotFoundError(f"Could not find config: {path}") from err
+        except yaml.YAMLError as err:
+            raise ValueError(f"Error parsing YAML config at {path}: {err}") from err
+
+    cached_config_path = CACHED_CONFIG_DIR / "environment.yaml"
+    default_path = (
+        cached_config_path
+        if cached_config_path.exists()
+        else Path(__file__).resolve().parent.parent / "config" / "environment.yaml"
+    )
+    config = load_yaml_config(default_path)
+
+    user_path = os.getenv("VEC_INF_CONFIG_DIR")
+    if user_path:
+        user_path_obj = Path(user_path, "environment.yaml")
+        if user_path_obj.exists():
+            user_config = load_yaml_config(user_path_obj)
+            config.update(user_config)
+        else:
+            warnings.warn(
+                f"WARNING: Could not find user config directory: {user_path}, revert to default config located at {default_path}",
+                UserWarning,
+                stacklevel=2,
+            )
+
+    return config
+
+
+_config = load_env_config()
+
+# Extract path values
+LD_LIBRARY_PATH = _config["paths"]["ld_library_path"]
+SINGULARITY_IMAGE = _config["paths"]["image_path"]
+VLLM_NCCL_SO_PATH = _config["paths"]["vllm_nccl_so_path"]
+
+# Extract containerization info
+SINGULARITY_LOAD_CMD = _config["containerization"]["module_load_cmd"]
+SINGULARITY_MODULE_NAME = _config["containerization"]["module_name"]
+
+# Extract limits
+MAX_GPUS_PER_NODE = _config["limits"]["max_gpus_per_node"]
+MAX_NUM_NODES = _config["limits"]["max_num_nodes"]
+MAX_CPUS_PER_TASK = _config["limits"]["max_cpus_per_task"]
+
+# Create dynamic Literal types
+QOS: TypeAlias = Literal[tuple(_config["allowed_values"]["qos"])]  # type: ignore[valid-type]
+PARTITION: TypeAlias = Literal[tuple(_config["allowed_values"]["partition"])]  # type: ignore[valid-type]
+
+# Extract default arguments
+DEFAULT_ARGS = _config["default_args"]
diff --git a/vec_inf/client/_utils.py b/vec_inf/client/_utils.py
@@ -15,9 +15,9 @@
 import yaml
 
 from vec_inf.client._client_vars import MODEL_READY_SIGNATURE
+from vec_inf.client._slurm_vars import CACHED_CONFIG_DIR
 from vec_inf.client.config import ModelConfig
 from vec_inf.client.models import ModelStatus
-from vec_inf.client.slurm_vars import CACHED_CONFIG
 
 
 def run_bash_command(command: str) -> tuple[str, str]:
@@ -217,44 +217,61 @@ def load_yaml_config(path: Path) -> dict[str, Any]:
         except yaml.YAMLError as err:
             raise ValueError(f"Error parsing YAML config at {path}: {err}") from err
 
-    # 1. If config_path is given, use only that
-    if config_path:
-        config = load_yaml_config(Path(config_path))
+    def process_config(config: dict[str, Any]) -> list[ModelConfig]:
+        """Process the config based on the config type."""
         return [
             ModelConfig(model_name=name, **model_data)
             for name, model_data in config.get("models", {}).items()
         ]
 
+    def resolve_config_path_from_env_var() -> Path | None:
+        """Resolve the config path from the environment variable."""
+        config_dir = os.getenv("VEC_INF_CONFIG_DIR")
+        config_path = os.getenv("VEC_INF_MODEL_CONFIG")
+        if config_path:
+            return Path(config_path)
+        if config_dir:
+            return Path(config_dir, "models.yaml")
+        return None
+
+    def update_config(
+        config: dict[str, Any], user_config: dict[str, Any]
+    ) -> dict[str, Any]:
+        """Update the config with the user config."""
+        for name, data in user_config.get("models", {}).items():
+            if name in config.get("models", {}):
+                config["models"][name].update(data)
+            else:
+                config.setdefault("models", {})[name] = data
+
+        return config
+
+    # 1. If config_path is given, use only that
+    if config_path:
+        config = load_yaml_config(Path(config_path))
+        return process_config(config)
+
     # 2. Otherwise, load default config
     default_path = (
-        CACHED_CONFIG
-        if CACHED_CONFIG.exists()
+        CACHED_CONFIG_DIR / "models_latest.yaml"
+        if CACHED_CONFIG_DIR.exists()
         else Path(__file__).resolve().parent.parent / "config" / "models.yaml"
     )
     config = load_yaml_config(default_path)
 
     # 3. If user config exists, merge it
-    user_path = os.getenv("VEC_INF_CONFIG")
-    if user_path:
-        user_path_obj = Path(user_path)
-        if user_path_obj.exists():
-            user_config = load_yaml_config(user_path_obj)
-            for name, data in user_config.get("models", {}).items():
-                if name in config.get("models", {}):
-                    config["models"][name].update(data)
-                else:
-                    config.setdefault("models", {})[name] = data
-        else:
-            warnings.warn(
-                f"WARNING: Could not find user config: {user_path}, revert to default config located at {default_path}",
-                UserWarning,
-                stacklevel=2,
-            )
-
-    return [
-        ModelConfig(model_name=name, **model_data)
-        for name, model_data in config.get("models", {}).items()
-    ]
+    user_path = resolve_config_path_from_env_var()
+    if user_path and user_path.exists():
+        user_config = load_yaml_config(user_path)
+        config = update_config(config, user_config)
+    elif user_path:
+        warnings.warn(
+            f"WARNING: Could not find user config: {str(user_path)}, revert to default config located at {default_path}",
+            UserWarning,
+            stacklevel=2,
+        )
+
+    return process_config(config)
 
 
 def parse_launch_output(output: str) -> tuple[str, dict[str, str]]:
diff --git a/vec_inf/client/config.py b/vec_inf/client/config.py
@@ -10,7 +10,7 @@
 from pydantic import BaseModel, ConfigDict, Field
 from typing_extensions import Literal
 
-from vec_inf.client.slurm_vars import (
+from vec_inf.client._slurm_vars import (
     DEFAULT_ARGS,
     MAX_CPUS_PER_TASK,
     MAX_GPUS_PER_NODE,
@@ -132,7 +132,6 @@ class ModelConfig(BaseModel):
     vllm_args: Optional[dict[str, Any]] = Field(
         default={}, description="vLLM engine arguments"
     )
-
     model_config = ConfigDict(
         extra="forbid", str_strip_whitespace=True, validate_default=True, frozen=True
     )
diff --git a/vec_inf/client/slurm_vars.py b/vec_inf/client/slurm_vars.py
diff --git a/vec_inf/config/environment.yaml b/vec_inf/config/environment.yaml
diff --git a/vec_inf/config/models.yaml b/vec_inf/config/models.yaml