triton-inference-server
diff --git a/‎docs/config_search.md‎
Lines changed: 91 additions & 14 deletions b/‎docs/config_search.md‎
Lines changed: 91 additions & 14 deletions
diff --git a/‎model_analyzer/config/generate/model_profile_spec.py‎
Lines changed: 55 additions & 2 deletions b/‎model_analyzer/config/generate/model_profile_spec.py‎
Lines changed: 55 additions & 2 deletions
diff --git a/‎model_analyzer/config/generate/quick_run_config_generator.py‎
Lines changed: 70 additions & 7 deletions b/‎model_analyzer/config/generate/quick_run_config_generator.py‎
Lines changed: 70 additions & 7 deletions
@@ -1,17 +1,6 @@
 <!--
-Copyright (c) 2020-2024, NVIDIA CORPORATION. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
+SPDX-FileCopyrightText: Copyright (c) 2020-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
 -->
 
 # Table of Contents
@@ -240,7 +229,8 @@ manual sweep:
 
 _This mode has the following limitations:_
 
-- If model config parameters are specified, they can contain only one possible combination of parameters
+- Top-level models can contain only one possible combination of model config parameters
+- Composing models (ensemble/BLS sub-models) can specify parameter ranges if they follow specific patterns (see [Ensemble Composing Model Parameter Ranges](#ensemble-composing-model-parameter-ranges))
 
 This mode uses a hill climbing algorithm to search the configuration space, looking for
 the maximal objective value within the specified constraints. In the majority of cases
@@ -262,6 +252,85 @@ profile_models:
 
 ---
 
+### **Ensemble Composing Model Parameter Ranges**
+
+When profiling ensemble or BLS models in Quick search mode, composing models (sub-models) can specify parameter ranges for `instance_group` count. This enables optimization of composing models with different resource requirements, such as:
+
+- CPU-bound models (tokenizers, preprocessors) that may benefit from higher instance counts
+- GPU-bound models (inference models, embeddings) with limited GPU memory
+
+**Supported Instance Count Patterns:**
+
+Model Analyzer supports two types of instance count sequences that map to Quick search's coordinate system:
+
+1. **Powers of 2**: `[1, 2, 4, 8, 16, 32]` or subsets like `[2, 4, 8]`
+   - Maps to exponential search dimensions
+   - Recommended for most use cases
+
+2. **Contiguous sequences**: `[1, 2, 3, 4, 5]` or ranges like `[5, 6, 7, 8]`
+   - Maps to linear search dimensions
+   - Useful for fine-grained control
+
+**Important Notes:**
+
+- Only composing models can specify instance count ranges in Quick mode
+- Top-level models (non-composing) must still have a single parameter combination
+- Only powers of 2 or contiguous sequences are supported; arbitrary value lists (e.g., `[1, 3, 7, 15]`) are not supported
+- Composing models are identified using `ensemble_composing_models`, `bls_composing_models`, or `cpu_only_composing_models` configuration
+
+---
+
+_An example with ensemble containing CPU tokenizer and GPU inference model:_
+
+```yaml
+model_repository: /path/to/model/repository/
+
+run_config_search_mode: quick
+export_path: /tmp/results
+override_output_model_repository: true
+
+profile_models:
+  - ensemble_model
+
+ensemble_composing_models:
+  tokenizer:
+    model_config_parameters:
+      instance_group:
+        - kind: KIND_CPU
+          count: [1, 2, 4, 8, 16, 32]  # Powers of 2 sequence
+      dynamic_batching:
+        max_queue_delay_microseconds: [0]
+  inference_model:
+    model_config_parameters:
+      instance_group:
+        - kind: KIND_GPU
+          count: [1, 2, 4, 8]  # Subset of powers of 2
+      dynamic_batching:
+        max_queue_delay_microseconds: [0]
+```
+
+In this example:
+- Only the ensemble is listed in `profile_models` - composing models are auto-discovered from `ensemble_scheduling`
+- The `ensemble_composing_models` section provides configurations for auto-discovered models
+- The tokenizer (CPU model) searches instance counts from 1 to 32
+- The inference model (GPU model) searches instance counts from 1 to 8
+- Quick search explores both dimensions in parallel to find the optimal combination
+- The ensemble model itself uses default parameters
+- Any models specified in `ensemble_composing_models` that don't exist in the ensemble will be ignored with a warning
+
+**Instance Group Kind:**
+
+The `kind` field (`KIND_CPU` or `KIND_GPU`) in `instance_group` is respected when explicitly specified. This allows you to control whether a model runs on CPU or GPU directly in the config without needing the separate `cpu_only_composing_models` option.
+
+Priority order for determining instance kind:
+1. **Explicit `kind` in `instance_group`** (highest priority) - if you specify `kind: KIND_CPU` or `kind: KIND_GPU`, that value is used
+2. **`cpu_only_composing_models` config** - models listed here will use KIND_CPU
+3. **Default to KIND_GPU** (lowest priority) - if neither is specified, models default to GPU instances
+
+This means you can override `cpu_only_composing_models` by explicitly specifying `kind: KIND_GPU` in the instance_group
+
+---
+
 ### **Limiting Batch Size, Instance Group, and Client Concurrency**
 
 Using the `--run-config-search-<min/max>...` config options you have the ability to clamp the algorithm's upper or lower bounds for the model's batch size and instance group count, as well as the client's request concurrency.
@@ -398,6 +467,10 @@ _This mode has the following limitations:_
 
 Ensemble models can be optimized using the Quick Search mode's hill climbing algorithm to search the composing models' configuration spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for ensembles that contain up to four composing models.
 
+**Composing Model Parameter Ranges:**
+
+Composing models within ensembles can specify instance count ranges to optimize models with different resource requirements (e.g., CPU tokenizers vs GPU inference models). See [Ensemble Composing Model Parameter Ranges](#ensemble-composing-model-parameter-ranges) for details on supported patterns and configuration examples.
+
 After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by `--num-configs-per-model`) over the concurrency range before generation of the summary reports.
 
 ---
@@ -412,6 +485,10 @@ _This mode has the following limitations:_
 
 BLS models can be optimized using the Quick Search mode's hill climbing algorithm to search the BLS composing models' configuration spaces, as well as the BLS model's instance count, in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for BLS models that contain up to four composing models.
 
+**Composing Model Parameter Ranges:**
+
+BLS composing models can specify instance count ranges to optimize models with different resource requirements. Models are identified using the `bls_composing_models` configuration parameter. See [Ensemble Composing Model Parameter Ranges](#ensemble-composing-model-parameter-ranges) for details on supported patterns and configuration examples.
+
 After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by `--num-configs-per-model`) over the concurrency range before generation of the summary reports.
 
 ---
 
@@ -3,7 +3,7 @@
 # SPDX-License-Identifier: Apache-2.0
 
 from copy import deepcopy
-from typing import List
+from typing import List, Optional
 
 from model_analyzer.config.input.config_command_profile import ConfigCommandProfile
 from model_analyzer.config.input.objects.config_model_profile_spec import (
@@ -34,8 +34,61 @@ def __init__(
             config, client, gpus, config.model_repository, spec.model_name()
         )
 
-        if spec.model_name() in config.cpu_only_composing_models:
+        # Determine if model should be CPU-only
+        # Priority: 1) User-specified kind in instance_group (highest)
+        #           2) cpu_only_composing_models config
+        #           3) Default to GPU (lowest)
+        explicit_kind = self._get_explicit_instance_kind(spec)
+        if explicit_kind == "KIND_CPU":
             self._cpu_only = True
+        elif explicit_kind == "KIND_GPU":
+            # Explicit GPU overrides cpu_only_composing_models
+            self._cpu_only = False
+        elif spec.model_name() in config.cpu_only_composing_models:
+            self._cpu_only = True
+        # Otherwise _cpu_only remains as inherited from spec (default False)
+
+    @staticmethod
+    def _get_explicit_instance_kind(spec: ConfigModelProfileSpec) -> Optional[str]:
+        """
+        Check if the spec has an explicit kind specified in instance_group.
+
+        Returns the kind if explicitly specified, None otherwise.
+        This allows users to specify KIND_CPU or KIND_GPU directly in
+        model_config_parameters.instance_group instead of using the
+        separate cpu_only_composing_models config option.
+
+        The config parser may wrap values in lists for sweep support, so we need
+        to handle structures like:
+        - [[{'count': [1, 2, 4], 'kind': ['KIND_CPU']}]]  (double-wrapped, kind is list)
+        - [{'count': [1, 2, 4], 'kind': 'KIND_CPU'}]      (single-wrapped, kind is string)
+        """
+        model_config_params = spec.model_config_parameters()
+        if model_config_params is None:
+            return None
+
+        instance_group = model_config_params.get("instance_group")
+        if instance_group is None or not isinstance(instance_group, list):
+            return None
+
+        # instance_group structure can be doubly wrapped due to config parsing:
+        # [[ {'kind': ['KIND_GPU'], 'count': [1, 2, 4]} ]]
+        # Unwrap the nested structure if needed
+        if len(instance_group) > 0 and isinstance(instance_group[0], list):
+            instance_group = instance_group[0]
+
+        # instance_group is now a list of dicts, each potentially containing 'kind'
+        for ig in instance_group:
+            if isinstance(ig, dict):
+                kind = ig.get("kind")
+                # Handle case where kind is wrapped in a list by config parser
+                # e.g., ['KIND_CPU'] instead of 'KIND_CPU'
+                if isinstance(kind, list) and len(kind) > 0:
+                    kind = kind[0]
+                if kind in ("KIND_CPU", "KIND_GPU"):
+                    return kind
+
+        return None
 
     def get_default_config(self) -> dict:
         """Returns the default configuration for this model"""
 
@@ -25,6 +25,7 @@
 from model_analyzer.config.run.model_run_config import ModelRunConfig
 from model_analyzer.config.run.run_config import RunConfig
 from model_analyzer.constants import LOGGER_NAME
+from model_analyzer.model_analyzer_exceptions import TritonModelAnalyzerException
 from model_analyzer.perf_analyzer.perf_config import PerfAnalyzerConfig
 from model_analyzer.result.run_config_measurement import RunConfigMeasurement
 from model_analyzer.triton.model.model_config import ModelConfig
@@ -425,33 +426,53 @@ def _get_next_model_config_variant(
         )
 
         model_config_params = deepcopy(model.model_config_parameters())
+
+        # Extract user-specified instance_group kind before removing it
+        instance_kind = self._extract_instance_group_kind(model_config_params)
+        if not instance_kind:
+            # Fallback to cpu_only flag
+            instance_kind = "KIND_CPU" if model.cpu_only() else "KIND_GPU"
+
         if model_config_params:
+            # Remove parameters that are controlled by search dimensions
             model_config_params.pop("max_batch_size", None)
+            model_config_params.pop("instance_group", None)
 
-            # This is guaranteed to only generate one combination (check is in config_command)
+            # Generate combinations from remaining parameters
+            # For composing models, this may include dynamic_batching settings, etc.
             param_combos = GeneratorUtils.generate_combinations(model_config_params)
-            assert len(param_combos) == 1
 
-            param_combo = param_combos[0]
+            # Top-level models must have exactly 1 combination (validated earlier)
+            # Composing models can have 1 combination (non-searchable params are fixed)
+            if len(param_combos) > 1:
+                raise TritonModelAnalyzerException(
+                    f"Model {model.model_name()} has multiple parameter combinations "
+                    f"after removing searchable parameters. This should have been caught "
+                    f"during config validation."
+                )
+
+            param_combo = param_combos[0] if param_combos else {}
         else:
             param_combo = {}
 
-        kind = "KIND_CPU" if model.cpu_only() else "KIND_GPU"
+        # Add instance_group with count from dimension and kind from config
         instance_count = self._calculate_instance_count(dimension_values)
-
         param_combo["instance_group"] = [
             {
                 "count": instance_count,
-                "kind": kind,
+                "kind": instance_kind,
             }
         ]
 
+        # Add max_batch_size from dimension if applicable
         if "max_batch_size" in dimension_values:
             param_combo["max_batch_size"] = self._calculate_model_batch_size(
                 dimension_values
             )
 
-        if model.supports_dynamic_batching():
+        # Add default dynamic_batching if model supports it and not already specified
+        # Preserves user-specified dynamic_batching settings (single combinations only)
+        if model.supports_dynamic_batching() and "dynamic_batching" not in param_combo:
             param_combo["dynamic_batching"] = {}
 
         model_config_variant = BaseModelConfigGenerator.make_model_config_variant(
@@ -463,6 +484,48 @@ def _get_next_model_config_variant(
 
         return model_config_variant
 
+    def _extract_instance_group_kind(self, model_config_params: dict) -> str:
+        """
+        Extract the 'kind' field from instance_group in model_config_parameters.
+
+        The config parser may wrap values in lists for sweep support, so we need
+        to handle structures like:
+        - [[{'count': [1, 2, 4], 'kind': ['KIND_CPU']}]]  (double-wrapped, kind is list)
+        - [{'count': [1, 2, 4], 'kind': 'KIND_CPU'}]      (single-wrapped, kind is string)
+
+        Returns empty string if not found or if instance_group is not specified.
+        """
+        if not model_config_params or "instance_group" not in model_config_params:
+            return ""
+
+        instance_group = model_config_params["instance_group"]
+
+        # Handle various nested list structures from config parsing
+        if isinstance(instance_group, list) and len(instance_group) > 0:
+            # Handle nested structure: [[ {...} ]]
+            while (
+                isinstance(instance_group, list)
+                and len(instance_group) > 0
+                and isinstance(instance_group[0], list)
+            ):
+                instance_group = instance_group[0]
+
+            # Now should have [{...}] structure
+            if (
+                isinstance(instance_group, list)
+                and len(instance_group) > 0
+                and isinstance(instance_group[0], dict)
+            ):
+                kind = instance_group[0].get("kind", "")
+                # Handle case where kind is wrapped in a list by config parser
+                # e.g., ['KIND_CPU'] instead of 'KIND_CPU'
+                if isinstance(kind, list) and len(kind) > 0:
+                    kind = kind[0]
+                if isinstance(kind, str) and kind in ("KIND_CPU", "KIND_GPU"):
+                    return kind
+
+        return ""
+
     def _create_next_model_run_config(
         self,
         model: ModelProfileSpec,