multimodal: update doc and model path in launcher (#48)

x574chen · Xiaotong Chen · web-flow · commit 0174d94f8a2e · 2024-12-20T14:29:20.000+08:00
* update multimodal doc and requirement

* update model path

---------

Co-authored-by: Xiaotong Chen &lt;“cxt459847@alibaba-inc.com”&gt;
diff --git a/docs/sphinx/vlm/vlm_offline_inference_en.rst b/docs/sphinx/vlm/vlm_offline_inference_en.rst
@@ -97,26 +97,30 @@ You can also use OpenAI's Python client library:
                },
          ],
       }],
-      stream=False,
+      stream=True,
       max_completion_tokens=1024,
       temperature=0.1,
    )
 
+   full_response = ""
+   for chunk in response:
+      full_response += chunk.choices[0].delta.content
+      print(".", end="")
+
+   print(f"\nFull Response: \n{full_response}")
+
 Launching with CLI
 -------------------------
 You can also opt to install dashinfer-vlm locally and use command line to launch server.
 
 1. Pull dashinfer docker image (see :ref:`docker-label`)
 2. Install TensorRT Python package, and download TensorRT GA build from NVIDIA Developer Zone.
 
-Example: TensorRT 10.6.0.26 for CUDA 12.6, Linux x86_64
-
 .. code-block:: bash
 
-   pip install tensorrt
-   wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
-   tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
-   export LD_LIBRARY_PATH=`pwd`/TensorRT-10.6.0.26/lib
+   wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/tars/TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz
+   tar -xvzf TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz
+   export LD_LIBRARY_PATH=`pwd`/TensorRT-10.5.0.18/lib
 
 3. Install dashinfer Python Package from `release <https://github.com/modelscope/dash-infer/releases>`_
 4. Install dashinfer-vlm: ``pip install dashinfer-vlm``.
diff --git a/multimodal/dashinfer_vlm/api_server/server.py b/multimodal/dashinfer_vlm/api_server/server.py
@@ -76,7 +76,8 @@ def init():
     context.set("chat_format", chat_format)
 
     # -----------------------Convert Model------------------------
-    output_dir = "/root/.cache/as_model/" + model.split("/")[-1]
+    home_dir = os.environ.get("HOME") or "/root"
+    output_dir = os.path.join(home_dir, ".cache/as_model/", model.split("/")[-1])
     model_name = "model"
     data_type = "bfloat16"
 
diff --git a/multimodal/requirements.txt b/multimodal/requirements.txt
@@ -1,3 +1,4 @@
+tensorrt==10.5.0
 av
 numpy==1.24.3
 requests==2.32.3
@@ -6,7 +7,7 @@ transformers>=4.45.0
 cachetools>=5.4.0
 six
 tiktoken
-openai==1.52.2
+openai>=1.56.2
 shortuuid
 fastapi
 pydantic_settings