aws-samples
diff --git a/‎llama-2-lmi/llama-2-13b-chat/1-deploy-llama-2-13b-chat-lmi-response-streaming.ipynb‎
Lines changed: 35 additions & 21 deletions b/‎llama-2-lmi/llama-2-13b-chat/1-deploy-llama-2-13b-chat-lmi-response-streaming.ipynb‎
Lines changed: 35 additions & 21 deletions
diff --git a/‎llama-2-lmi/llama-2-13b-chat/2-inference-llama-2-13b-chat-lmi-response-streaming.ipynb‎
Lines changed: 7 additions & 27 deletions b/‎llama-2-lmi/llama-2-13b-chat/2-inference-llama-2-13b-chat-lmi-response-streaming.ipynb‎
Lines changed: 7 additions & 27 deletions
diff --git a/‎llama-2-lmi/llama-2-70b-chat/1-deploy-llama-2-70b-chat-lmi-response-streaming.ipynb‎
Lines changed: 28 additions & 71 deletions b/‎llama-2-lmi/llama-2-70b-chat/1-deploy-llama-2-70b-chat-lmi-response-streaming.ipynb‎
Lines changed: 28 additions & 71 deletions
@@ -234,16 +234,18 @@
     "Here is a list of settings that we use in this configuration file -\n",
     "\n",
     "- `engine`: The runtime engine for DJL to use. The possible values for engine include *Python*, *DeepSpeed*, *FasterTransformer*, and *MPI*. In this case, we set it to MPI. MPI, Model Parallelization and Inference facilitates partitioning the model across all the available GPUs and thus accelerate the inference.\n",
-    "- `option.tensor_parallel_degree` - This option specifies number of tensor parallel partitions performed on the model.\n",
+    "- `option.tensor_parallel_degree` - This option specifies number of tensor parallel partitions performed on the model. Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. For example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.\n",
+    "- `option.low_cpu_mem_usage` - Reduces CPU memory usage when loading models. We recommend that you set this to TRUE.\n",
     "- `option.rolling_batch` – Enables iteration-level batching using one of the supported strategies. Values include `auto`, `scheduler`, and `lmi-dist`. We use `lmi-dist` for turning on continuous batching for Llama 2.\n",
     "- `option.max_rolling_batch_size` – Limits the number of concurrent requests in the continuous batch. Defaults to 32.\n",
     "- `option.model_id`: The model id of a pretrained model hosted inside a [model repository on huggingface](https://huggingface.co/models) or S3 path to the model artefact. \n",
-    "- `option.tensor_parallel_degree`: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. For example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.\n",
-    "- `option.enable_streaming`: As we need a response streaming for inferencing have reduced perceived latency, we will set it to *true*\n",
+    "- `option.paged_attention` - Use PagedAttention or not. Default is always use. Disable this if you plan to run on G4 or older GPU architecture\n",
     "\n",
-    "For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.\n",
+    "For more details on the configuration options and an exhaustive list, you can refer the documentation -\n",
     "\n",
-    "Since we are serving the model using deepspeed container, and Llama 2 being a large model used for inference,  we are following the approach of [Large model inference with DeepSpeed and DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)"
+    "[Model parallelism and large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)\n",
+    "\n",
+    "Since we are serving the model using deepspeed container, and Llama 2 being a large model used for inference,  we are following the approach of [Large model inference with DeepSpeed and DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)\n"
    ]
   },
   {
@@ -260,16 +262,6 @@
     "model_id = base_model_s3_uri"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "3bc7098f-9739-4504-a98f-553936b4f5ab",
-   "metadata": {},
-   "source": [
-    "We will also set *enable_streaming* to *true* for obtaining response stream when we inference Llama 2. Since we are deploying llama 2 13b, we are setting the **tensor_parallel_degree** to **4** and making use of all the 4 NVIDIA A10 GPUs available on the [`ml.g5.12xlarge`](https://aws.amazon.com/sagemaker/pricing/#:~:text=Amazon%20SageMaker%20G5%20instance%20product%20details) instance.\n",
-    "\n",
-    "https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -283,13 +275,22 @@
     "engine = MPI\n",
     "option.entryPoint=djl_python.huggingface\n",
     "option.tensor_parallel_degree=4\n",
-    "option.rolling_batch_type=LmiDistRollingBatch\n",
+    "option.low_cpu_mem_usage=TRUE\n",
     "option.rolling_batch=lmi-dist\n",
     "option.max_rolling_batch_size=64\n",
-    "option.model_loading_timeout=3600\n",
+    "option.model_loading_timeout=900\n",
     "option.model_id={{model_id}}\n",
-    "option.paged_attention=true\n",
-    "option.enable_streaming=true"
+    "option.paged_attention=true"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bc7098f-9739-4504-a98f-553936b4f5ab",
+   "metadata": {},
+   "source": [
+    "If you are using djl-deepspeed container version 0.25.0, you don't need to set the serving parameter `option.enable_streaming` to true, as the streaming is enabled by default. If you are using version 0.24.0, you will need to set `option.enable_streaming` to `true` for obtaining response stream while inferencing. Since we are deploying llama 2 13b, we are setting the **tensor_parallel_degree** to **4** and making use of all the 4 NVIDIA A10 GPUs available on the [`ml.g5.12xlarge`](https://aws.amazon.com/sagemaker/pricing/#:~:text=Amazon%20SageMaker%20G5%20instance%20product%20details) instance.\n",
+    "\n",
+    "https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/"
    ]
   },
   {
@@ -327,7 +328,7 @@
    "outputs": [],
    "source": [
     "inference_image_uri = image_uris.retrieve(\n",
-    "    framework=\"djl-deepspeed\", region=region, version=\"0.24.0\"\n",
+    "    framework=\"djl-deepspeed\", region=region, version=\"0.25.0\"\n",
     ")\n",
     "inference_image_uri"
    ]
@@ -582,10 +583,23 @@
     "s3_prefix"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "c6a3dec0-6c74-449a-a614-0ac0c82433fc",
+   "metadata": {},
+   "source": [
+    "## References:\n",
+    "\n",
+    "- [Improve throughput performance of Llama 2 models using Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/)\n",
+    "- [Improve performance of Falcon models with Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/improve-performance-of-falcon-models-with-amazon-sagemaker/)\n",
+    "- [serving.properties - Configurations and settings](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)\n",
+    "- [Amazon SageMaker launches a new version of Large Model Inference DLC with TensorRT-LLM support](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-sagemaker-large-model-inference-dlc-tensorrt-llm-support/)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "40179afd-336a-4336-8cc2-9b47cf7faded",
+   "id": "55fa97ed-f52f-4456-a214-ca2302f2844f",
    "metadata": {},
    "outputs": [],
    "source": []
 
@@ -103,7 +103,7 @@
    "outputs": [],
    "source": [
     "def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):\n",
-    "    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(\n",
+    "    response_stream = sagemaker_runtime.stream(\n",
     "        EndpointName=endpoint_name,\n",
     "        Body=json.dumps(payload), \n",
     "        ContentType=\"application/json\",\n",
@@ -129,25 +129,15 @@
    },
    "outputs": [],
    "source": [
+    "import sys, os\n",
+    "module_path = \"..\"\n",
+    "sys.path.append(os.path.abspath(module_path))\n",
     "from utils.LineIterator import LineIterator\n",
     "\n",
     "def print_response_stream(response_stream):\n",
     "    event_stream = response_stream.get('Body')\n",
-    "    start_sequence = b'{\"generated_text\":'\n",
-    "    stop_sequence = b'\"}'\n",
-    "    new_line = b'\\\\n'\n",
-    "    \n",
     "    for line in LineIterator(event_stream):\n",
-    "        if start_sequence in line:\n",
-    "            continue\n",
-    "        if line == new_line:\n",
-    "            print()\n",
-    "            continue\n",
-    "        elif line.endswith(stop_sequence):\n",
-    "            line =line.rstrip(stop_sequence)\n",
-    "            \n",
-    "        data = line.decode('utf-8')\n",
-    "        print(data,end='')"
+    "        print(line, end='')"
    ]
   },
   {
@@ -288,18 +278,9 @@
     "payload = {\n",
     "    \"inputs\":  prompt,\n",
     "    \"parameters\": inference_params,\n",
-    "    \"stream\": True ## <-- to have response stream.\n",
     "}"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "85464065-b85e-452a-83e8-4546b0115219",
-   "metadata": {},
-   "source": [
-    "As we are interested in streaming response, the request payload must provide a key value pair with **\"stream\": True**"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -350,8 +331,7 @@
     "    }\n",
     "payload = {\n",
     "    \"inputs\":  prompt,\n",
-    "    \"parameters\": inference_params,\n",
-    "    \"stream\": True ## <-- to have response stream.\n",
+    "    \"parameters\": inference_params\n",
     "}\n",
     "resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)\n",
     "print_response_stream(resp)"
@@ -1049,7 +1029,7 @@
     "vcpuNum": 128
    }
   ],
-  "instance_type": "ml.r5.4xlarge",
+  "instance_type": "ml.t3.medium",
   "kernelspec": {
    "display_name": "Python 3 (Data Science 3.0)",
    "language": "python",
 
@@ -51,30 +51,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "id": "8dd6a36f-ee95-453d-a127-c8a7de6a026d",
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
-      "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
-      "\u001b[0m"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "!pip install -Uq pip\n",
     "!pip install -Uq sagemaker boto3 huggingface_hub "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
    "id": "cf0c89f4-679c-4557-b95d-1d954c15a020",
    "metadata": {
     "tags": []
@@ -93,23 +83,12 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "id": "83d5a162-e9be-469b-910e-18cca8c359f8",
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
-      "sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml\n",
-      "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
-      "sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "role = sagemaker.get_execution_role()  # execution role for the endpoint\n",
     "sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
@@ -118,7 +97,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "f68b5181-d018-4564-9762-fa8770a9672f",
    "metadata": {
     "tags": []
@@ -159,35 +138,12 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "id": "94af859c-4c3a-4fda-ae27-890be565a906",
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "45357719d0564c0b85a87e104ef9c0e1",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Fetching 39 files:   0%|          | 0/39 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "CPU times: user 151 ms, sys: 10.9 ms, total: 162 ms\n",
-      "Wall time: 546 ms\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "%%time\n",
     "from huggingface_hub import snapshot_download\n",
@@ -278,16 +234,18 @@
     "Here is a list of settings that we use in this configuration file -\n",
     "\n",
     "- `engine`: The runtime engine for DJL to use. The possible values for engine include *Python*, *DeepSpeed*, *FasterTransformer*, and *MPI*. In this case, we set it to MPI. MPI, Model Parallelization and Inference facilitates partitioning the model across all the available GPUs and thus accelerate the inference.\n",
-    "- `option.tensor_parallel_degree` - This option specifies number of tensor parallel partitions performed on the model.\n",
+    "- `option.tensor_parallel_degree` - This option specifies number of tensor parallel partitions performed on the model. Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. For example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.\n",
+    "- `option.low_cpu_mem_usage` - Reduces CPU memory usage when loading models. We recommend that you set this to TRUE.\n",
     "- `option.rolling_batch` – Enables iteration-level batching using one of the supported strategies. Values include `auto`, `scheduler`, and `lmi-dist`. We use `lmi-dist` for turning on continuous batching for Llama 2.\n",
     "- `option.max_rolling_batch_size` – Limits the number of concurrent requests in the continuous batch. Defaults to 32.\n",
     "- `option.model_id`: The model id of a pretrained model hosted inside a [model repository on huggingface](https://huggingface.co/models) or S3 path to the model artefact. \n",
-    "- `option.tensor_parallel_degree`: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. For example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.\n",
-    "- `option.enable_streaming`: As we need a response streaming for inferencing have reduced perceived latency, we will set it to *true*\n",
+    "- `option.paged_attention` - Use PagedAttention or not. Default is always use. Disable this if you plan to run on G4 or older GPU architecture\n",
+    "\n",
+    "For more details on the configuration options and an exhaustive list, you can refer the documentation -\n",
     "\n",
-    "For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.\n",
+    "[Model parallelism and large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)\n",
     "\n",
-    "Since we are serving the model using deepspeed container, and Llama 2 being a large model used for inference,  we are following the approach of [Large model inference with DeepSpeed and DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)"
+    "Since we are serving the model using deepspeed container, and Llama 2 being a large model used for inference,  we are following the approach of [Large model inference with DeepSpeed and DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)\n"
    ]
   },
   {
@@ -304,14 +262,6 @@
     "model_id = base_model_s3_uri"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "3bc7098f-9739-4504-a98f-553936b4f5ab",
-   "metadata": {},
-   "source": [
-    "We will also set *enable_streaming* to *true* for obtaining response stream when we inference Llama 2. Since we are deploying llama 2 70B chat, we are setting the **tensor_parallel_degree** to **8** and making use of all the 8 NVIDIA A10G Tensor core GPUs available on the [`ml.g5.48xlarge`](https://aws.amazon.com/sagemaker/pricing/#:~:text=Amazon%20SageMaker%20G5%20instance%20product%20details) instance."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -325,14 +275,20 @@
     "engine = MPI\n",
     "option.entryPoint=djl_python.huggingface\n",
     "option.tensor_parallel_degree=8\n",
-    "option.rolling_batch_type=LmiDistRollingBatch\n",
     "option.rolling_batch=lmi-dist\n",
     "option.max_rolling_batch_size=64\n",
+    "option.model_loading_timeout=900\n",
     "option.max_rolling_batch_prefill_tokens=16384\n",
-    "option.model_loading_timeout=120\n",
     "option.model_id={{model_id}}\n",
-    "option.paged_attention=true\n",
-    "option.enable_streaming=true"
+    "option.paged_attention=true"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bc7098f-9739-4504-a98f-553936b4f5ab",
+   "metadata": {},
+   "source": [
+    "If you are using djl-deepspeed container version 0.25.0, you don't need to set the serving parameter `option.enable_streaming` to true, as the streaming is enabled by default. If you are using version 0.24.0, you will need to set `option.enable_streaming` to `true` for obtaining response stream while inferencing. Since we are deploying llama 2 70B chat, we are setting the **tensor_parallel_degree** to **8** and making use of all the 8 NVIDIA A10G Tensor core GPUs available on the [`ml.g5.48xlarge`](https://aws.amazon.com/sagemaker/pricing/#:~:text=Amazon%20SageMaker%20G5%20instance%20product%20details) instance."
    ]
   },
   {
@@ -370,7 +326,7 @@
    "outputs": [],
    "source": [
     "inference_image_uri = image_uris.retrieve(\n",
-    "    framework=\"djl-deepspeed\", region=region, version=\"0.24.0\"\n",
+    "    framework=\"djl-deepspeed\", region=region, version=\"0.25.0\"\n",
     ")\n",
     "inference_image_uri"
    ]
@@ -634,7 +590,8 @@
     "\n",
     "- [Improve throughput performance of Llama 2 models using Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/)\n",
     "- [Improve performance of Falcon models with Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/improve-performance-of-falcon-models-with-amazon-sagemaker/)\n",
-    "- [serving.properties - Configurations and settings](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)"
+    "- [serving.properties - Configurations and settings](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)\n",
+    "- [Amazon SageMaker launches a new version of Large Model Inference DLC with TensorRT-LLM support](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-sagemaker-large-model-inference-dlc-tensorrt-llm-support/)"
    ]
   },
   {