xorbitsai · OliverBryant · Nov 3, 2025 · Nov 3, 2025 · Nov 3, 2025 · Nov 5, 2025
diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/models/model_abilities/audio.po b/doc/source/locale/zh_CN/LC_MESSAGES/models/model_abilities/audio.po
@@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: Xinference \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-09-22 11:25+0800\n"
+"POT-Creation-Date: 2025-11-05 17:38+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: zh_CN\n"
@@ -701,9 +701,9 @@ msgid ""
 "``False`` , and setting it to ``True`` enables randomness:"
 msgstr ""
 "可以省略情绪参考音频，转而提供一个包含8个浮点数的列表，按以下顺序指定每种"
-"情绪的强度: ``[快乐, 愤怒, 悲伤, 恐惧, 厌恶, 忧郁, 惊讶, 平静]`` 。您还可以"
-"使用 ``use_random`` 参数在推理过程中引入随机性情绪；默认值为 ``False`` ，设置为 ``"
-"True`` 即可启用随机性情绪。"
+"情绪的强度: ``[快乐, 愤怒, 悲伤, 恐惧, 厌恶, 忧郁, 惊讶, 平静]`` 。您还"
+"可以使用 ``use_random`` 参数在推理过程中引入随机性情绪；默认值为 ``False`"
+"` ，设置为 ``True`` 即可启用随机性情绪。"
 
 #: ../../source/models/model_abilities/audio.rst:712
 msgid ""
@@ -714,10 +714,10 @@ msgid ""
 "for more natural sounding speech. You can introduce randomness with "
 "``use_random`` (default: ``False``; ``True`` enables randomness):"
 msgstr ""
-"或者，您可以启用 ``use_emo_text`` 功能，根据您提供的 ``text`` 脚本引导情感"
-"表达。您的文本脚本将自动转换为情感向量。使用文本情感模式时，建议将 ``emo_"
-"alpha`` 设置为 0.6 左右（或更低），以获得更自然的语音效果。您可通过 ``use_"
-"random`` 引入随机性（默认值：``False`` ；``True`` 启用随机性）："
+"或者，您可以启用 ``use_emo_text`` 功能，根据您提供的 ``text`` 脚本引导"
+"情感表达。您的文本脚本将自动转换为情感向量。使用文本情感模式时，建议将 ``"
+"emo_alpha`` 设置为 0.6 左右（或更低），以获得更自然的语音效果。您可通过 `"
+"`use_random`` 引入随机性（默认值：``False`` ；``True`` 启用随机性）："
 
 #: ../../source/models/model_abilities/audio.rst:737
 msgid ""
@@ -729,6 +729,110 @@ msgstr ""
 "您也可以通过 ``emo_text`` 参数直接提供特定的文本情绪描述。您的情绪文本将"
 "自动转换为情绪向量。这使您能够分别控制文本脚本和文本情绪描述："
 
+#: ../../source/models/model_abilities/audio.rst:761
+msgid "IndexTTS2 Offline Usage"
+msgstr "IndexTTS2 离线使用"
+
+#: ../../source/models/model_abilities/audio.rst:763
+msgid ""
+"IndexTTS2 requires several small models that are downloaded automatically"
+" during initialization. For offline environments, you can download these "
+"models to a single directory and specify the directory path."
+msgstr ""
+"IndexTTS2需要多个小型模型，这些模型会在初始化过程中自动下载。在离线环境中"
+"，您可以将这些模型下载到单一目录，并指定该目录路径。"
+
+#: ../../source/models/model_abilities/audio.rst:766
+msgid "**Easy Setup Method**"
+msgstr "**简易设置方法**"
+
+#: ../../source/models/model_abilities/audio.rst:768
+msgid ""
+"The simplest way to set up offline usage is to copy the already "
+"downloaded models from your Hugging Face cache:"
+msgstr ""
+"设置离线使用的最简单方法是将已下载的模型从Hugging Face缓存中复制出来："
+
+#: ../../source/models/model_abilities/audio.rst:770
+msgid ""
+"**Find your Hugging Face cache directory** (usually "
+"``~/.cache/huggingface/hub/``)"
+msgstr ""
+"**查找您的Hugging Face缓存目录** （通常位于 ``~/.cache/huggingface/hub/`` ）"
+
+#: ../../source/models/model_abilities/audio.rst:771
+msgid "**Copy the required models** to your target directory:"
+msgstr "**将所需模型** 复制到目标目录："
+
+#: ../../source/models/model_abilities/audio.rst:784
+msgid "The final directory structure should look like this:"
+msgstr "最终的目录结构应如下所示："
+
+#: ../../source/models/model_abilities/audio.rst:810
+msgid "**Required Models**"
+msgstr "**支持的模型列表**"
+
+#: ../../source/models/model_abilities/audio.rst:812
+msgid "The small models are automatically mapped as follows:"
+msgstr "小型模型将按以下方式自动映射："
+
+#: ../../source/models/model_abilities/audio.rst:814
+msgid ""
+"**w2v-bert-2.0** (``models--facebook--w2v-bert-2.0``) - Feature "
+"extraction model"
+msgstr "**w2v-bert-2.0** (``models--facebook--w2v-bert-2.0``) - 特征提取模型"
+
+#: ../../source/models/model_abilities/audio.rst:815
+msgid "**campplus** (``models--funasr--campplus``) - Speaker recognition model"
+msgstr "**campplus** (``models--funasr--campplus``) - 说话人识别模型"
+
+#: ../../source/models/model_abilities/audio.rst:816
+msgid ""
+"**bigvgan** (``models--nvidia--bigvgan_v2_22khz_80band_256x``) - Vocoder "
+"model"
+msgstr "**bigvgan** (``models--nvidia--bigvgan_v2_22khz_80band_256x``) - 语音编码器模型"
+
+#: ../../source/models/model_abilities/audio.rst:817
+msgid ""
+"**semantic_codec** (``models--amphion--MaskGCT``) - Semantic "
+"encoding/decoding model"
+msgstr "**语义编解码器** (``models--amphion--MaskGCT``) - 语义编码/解码模型"
+
+#: ../../source/models/model_abilities/audio.rst:819
+msgid "**Note about Directory Structure**"
+msgstr "**关于目录结构的说明**"
+
+#: ../../source/models/model_abilities/audio.rst:821
+msgid ""
+"The ``snapshots/`` directories contain version-specific model files with "
+"hash names. Xinference automatically detects and uses the correct "
+"snapshot directory, so you don't need to worry about the exact hash "
+"values."
+msgstr ""
+"``snapshots/`` 目录包含具有哈希名称的特定版本模型文件。"
+"Xinference会自动检测并使用正确的快照目录，因此您无需担心精确的哈希值。"
+
+#: ../../source/models/model_abilities/audio.rst:823
+msgid "**Launching IndexTTS2 with Offline Models**"
+msgstr "**使用离线模式启动IndexTTS2**"
+
+#: ../../source/models/model_abilities/audio.rst:825
+msgid ""
+"When launching IndexTTS2 with Web UI, you can add an additional "
+"parameter: - ``small_models_dir`` - Path to directory containing all "
+"small models"
+msgstr ""
+"在通过Web UI启动IndexTTS2时，可添加额外参数：- ``small_models_dir`` - "
+"包含所有小型模型的目录路径"
+
+#: ../../source/models/model_abilities/audio.rst:828
+msgid "When launching with command line, you can add the option:"
+msgstr "在通过命令行启动时，您可以添加以下选项："
+
+#: ../../source/models/model_abilities/audio.rst:835
+msgid "When launching with Python client:"
+msgstr "使用 Python 客户端启动时："
+
 #~ msgid "**random sampling**"
 #~ msgstr ""
 
@@ -755,3 +859,31 @@ msgstr ""
 #~ "`False`; `True` enables randomness):"
 #~ msgstr ""
 
+#~ msgid ""
+#~ "The required small models are: 1. "
+#~ "**w2v-bert-2.0** - Feature extraction model"
+#~ " (place in ``w2v-bert-2.0/`` subdirectory)"
+#~ " 2. **semantic_codec** - Semantic "
+#~ "encoding/decoding model (place in "
+#~ "``semantic_codec/`` subdirectory) 3. **campplus**"
+#~ " - Speaker recognition model (place "
+#~ "in ``campplus/`` subdirectory) 4. **bigvgan**"
+#~ " - Vocoder model (place in "
+#~ "``bigvgan/`` subdirectory)"
+#~ msgstr ""
+#~ "所需的小型模型包括：1. **w2v-"
+#~ "bert-2.0** - 特征提取模型（放置于"
+#~ "``w2v-bert-2.0/``子目录）2. "
+#~ "**semantic_codec** - 语义编码/解码"
+#~ "模型（放置于``semantic_codec/``"
+#~ "子目录）3. **campplus** - 说话"
+#~ "人识别模型（放置于``campplus/``"
+#~ "子目录） 4. **bigvgan** - 声"
+#~ "码器模型（放置于``bigvgan/``子目录"
+#~ "）"
+
+#~ msgid ""
+#~ "Assume downloaded to ``/path/to/small_models`` "
+#~ "with the following structure:"
+#~ msgstr "假设下载到``/path/to/small_models``目录，其结构如下："
+
diff --git a/doc/source/models/model_abilities/audio.rst b/doc/source/models/model_abilities/audio.rst
@@ -757,5 +757,90 @@ Here are several examples of how to use IndexTTS2:
             use_random=False
         )
 
+IndexTTS2 Offline Usage
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+IndexTTS2 requires several small models that are downloaded automatically during initialization.
+For offline environments, you can download these models to a single directory and specify the directory path.
+
+**Easy Setup Method**
+
+The simplest way to set up offline usage is to copy the already downloaded models from your Hugging Face cache:
+
+1. **Find your Hugging Face cache directory** (usually ``~/.cache/huggingface/hub/``)
+2. **Copy the required models** to your target directory:
+
+.. code-block:: bash
+
+    # Create your local models directory
+    mkdir -p /path/to/small_models
+
+    # Copy the downloaded models from Hugging Face cache
+    cp -r ~/.cache/huggingface/hub/models--facebook--w2v-bert-2.0 /path/to/small_models/
+    cp -r ~/.cache/huggingface/hub/models--funasr--campplus /path/to/small_models/
+    cp -r ~/.cache/huggingface/hub/models--nvidia--bigvgan_v2_22khz_80band_256x /path/to/small_models/
+    cp -r ~/.cache/huggingface/hub/models--amphion--MaskGCT /path/to/small_models/
+
+The final directory structure should look like this:
+
+.. code-block:: text
+
+    /path/to/small_models/
+    ├── models--facebook--w2v-bert-2.0/              # Feature extraction model
+    │   └── snapshots/
+    │       └── [hash]/
+    │           ├── config.json
+    │           ├── model.safetensors
+    │           └── preprocessor_config.json
+    ├── models--funasr--campplus/                     # Speaker recognition model
+    │   └── snapshots/
+    │       └── [hash]/
+    │           └── campplus_cn_common.bin
+    ├── models--nvidia--bigvgan_v2_22khz_80band_256x/ # Vocoder model
+    │   └── snapshots/
+    │       └── [hash]/
+    │           ├── config.json
+    │           └── bigvgan_generator.pt
+    └── models--amphion--MaskGCT/                      # Semantic codec model
+        └── snapshots/
+            └── [hash]/
+                └── semantic_codec/
+                    └── model.safetensors
+
+**Required Models**
+
+The small models are automatically mapped as follows:
+
+1. **w2v-bert-2.0** (``models--facebook--w2v-bert-2.0``) - Feature extraction model
+2. **campplus** (``models--funasr--campplus``) - Speaker recognition model
+3. **bigvgan** (``models--nvidia--bigvgan_v2_22khz_80band_256x``) - Vocoder model
+4. **semantic_codec** (``models--amphion--MaskGCT``) - Semantic encoding/decoding model
+
+**Note about Directory Structure**
+
+The ``snapshots/`` directories contain version-specific model files with hash names. Xinference automatically detects and uses the correct snapshot directory, so you don't need to worry about the exact hash values.
+
+**Launching IndexTTS2 with Offline Models**
+
+When launching IndexTTS2 with Web UI, you can add an additional parameter:
+- ``small_models_dir`` - Path to directory containing all small models
+
+When launching with command line, you can add the option:
+
+.. code-block:: bash
+
+    xinference launch --model-name IndexTTS2 --model-type audio \
+        --small_models_dir /path/to/small_models
+
+When launching with Python client:
+
+.. code-block:: python
+
+    model_uid = client.launch_model(
+        model_name="IndexTTS2",
+        model_type="audio",
+        small_models_dir="/path/to/small_models"
+    )
+
 
 
diff --git a/xinference/model/audio/indextts2.py b/xinference/model/audio/indextts2.py
@@ -56,13 +56,25 @@ def load(self):
         use_fp16 = self._kwargs.get("use_fp16", False)
         use_deepspeed = self._kwargs.get("use_deepspeed", False)
 
-        logger.info("Loading IndexTTS2 model...")
+        # Handle small model directory for offline deployment
+        small_models_config = (
+            self._model_spec.default_model_config
+            if getattr(self._model_spec, "default_model_config", None)
+            else {}
+        )
+        small_models_config.update(self._kwargs)
+
+        small_models_dir = small_models_config.get("small_models_dir")
+        logger.info(
+            f"Loading IndexTTS2 model... (small_models_dir: {small_models_dir})"
+        )
         self._model = IndexTTS2(
             cfg_path=config_path,
             model_dir=self._model_path,
             use_fp16=use_fp16,
             device=self._device,
             use_deepspeed=use_deepspeed,
+            small_models_dir=small_models_dir,
         )
 
     def speech(

diff --git a/xinference/model/audio/model_spec.json b/xinference/model/audio/model_spec.json
@@ -943,6 +943,9 @@
       "text2audio_emotion_control"
     ],
     "multilingual": true,
+    "default_model_config": {
+      "small_models_dir": null
+    },
     "virtualenv": {
       "packages": [
         "transformers==4.52.1",