Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/autodocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,5 @@ jobs:

- name: Check that documentation is up-to-date
run: |
npm install -g @redocly/cli
npm install -g @redocly/cli@1.34.2
python update_doc.py --check
16 changes: 8 additions & 8 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ default-members = [
resolver = "2"

[workspace.package]
version = "3.3.4-dev0"
version = "3.3.5-dev0"
edition = "2021"
authors = ["Olivier Dehaene"]
homepage = "https://github.com/huggingface/text-generation-inference"
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model
ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model
```

And then you can make requests like
Expand Down Expand Up @@ -121,7 +121,7 @@ curl localhost:8080/v1/chat/completions \

**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.

**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.4-rocm --model-id $model` instead of the command above.
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.5-rocm --model-id $model` instead of the command above.

To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
Expand Down Expand Up @@ -152,7 +152,7 @@ volume=$PWD/data # share a volume with the Docker container to avoid downloading
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model
ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model
```

### A note on Shared Memory (shm)
Expand Down
10 changes: 5 additions & 5 deletions backends/gaudi/examples/docker_commands/docker_commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ docker run -p 8080:80 \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
ghcr.io/huggingface/text-generation-inference:3.3.5-gaudi \
--model-id $model \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-size 32 \
Expand All @@ -39,7 +39,7 @@ docker run -p 8080:80 \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
ghcr.io/huggingface/text-generation-inference:3.3.5-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
Expand All @@ -58,7 +58,7 @@ docker run -p 8080:80 \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
ghcr.io/huggingface/text-generation-inference:3.3.5-gaudi \
--model-id $model \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-size 4
Expand All @@ -81,7 +81,7 @@ docker run -p 8080:80 \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
ghcr.io/huggingface/text-generation-inference:3.3.5-gaudi \
--model-id $model \
--kv-cache-dtype fp8_e4m3fn \
--max-input-tokens 1024 --max-total-tokens 2048 \
Expand All @@ -102,7 +102,7 @@ docker run -p 8080:80 \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
ghcr.io/huggingface/text-generation-inference:3.3.5-gaudi \
--model-id $model \
--kv-cache-dtype fp8_e4m3fn \
--sharded true --num-shard 8 \
Expand Down
1 change: 1 addition & 0 deletions backends/neuron/tests/server/test_prefill.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def _test_prefill(config_name, generator, batch_size, do_sample):
assert tokens.ids[0] == expectations[0]
assert tokens.texts[0] == expectations[1]


def test_prefill_truncate(neuron_model_config):
config_name = neuron_model_config["name"]
neuron_model_path = neuron_model_config["neuron_model_path"]
Expand Down
62 changes: 41 additions & 21 deletions docs/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"name": "Apache 2.0",
"url": "https://www.apache.org/licenses/LICENSE-2.0"
},
"version": "3.3.4-dev0"
"version": "3.3.5-dev0"
},
"paths": {
"/": {
Expand Down Expand Up @@ -57,7 +57,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Input validation error"
"error": "Input validation error",
"error_type": "validation"
}
}
}
Expand All @@ -70,7 +71,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Request failed during generation"
"error": "Request failed during generation",
"error_type": "generation"
}
}
}
Expand All @@ -83,7 +85,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Model is overloaded"
"error": "Model is overloaded",
"error_type": "overloaded"
}
}
}
Expand All @@ -96,7 +99,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Incomplete generation"
"error": "Incomplete generation",
"error_type": "incomplete_generation"
}
}
}
Expand Down Expand Up @@ -181,7 +185,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Input validation error"
"error": "Input validation error",
"error_type": "validation"
}
}
}
Expand All @@ -194,7 +199,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Request failed during generation"
"error": "Request failed during generation",
"error_type": "generation"
}
}
}
Expand All @@ -207,7 +213,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Model is overloaded"
"error": "Model is overloaded",
"error_type": "overloaded"
}
}
}
Expand All @@ -220,7 +227,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Incomplete generation"
"error": "Incomplete generation",
"error_type": "incomplete_generation"
}
}
}
Expand Down Expand Up @@ -264,7 +272,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Input validation error"
"error": "Input validation error",
"error_type": "validation"
}
}
}
Expand All @@ -277,7 +286,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Request failed during generation"
"error": "Request failed during generation",
"error_type": "generation"
}
}
}
Expand All @@ -290,7 +300,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Model is overloaded"
"error": "Model is overloaded",
"error_type": "overloaded"
}
}
}
Expand All @@ -303,7 +314,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Incomplete generation"
"error": "Incomplete generation",
"error_type": "incomplete_generation"
}
}
}
Expand Down Expand Up @@ -558,7 +570,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Input validation error"
"error": "Input validation error",
"error_type": "validation"
}
}
}
Expand All @@ -571,7 +584,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Request failed during generation"
"error": "Request failed during generation",
"error_type": "generation"
}
}
}
Expand All @@ -584,7 +598,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Model is overloaded"
"error": "Model is overloaded",
"error_type": "overloaded"
}
}
}
Expand All @@ -597,7 +612,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Incomplete generation"
"error": "Incomplete generation",
"error_type": "incomplete_generation"
}
}
}
Expand Down Expand Up @@ -646,7 +662,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Input validation error"
"error": "Input validation error",
"error_type": "validation"
}
}
}
Expand All @@ -659,7 +676,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Request failed during generation"
"error": "Request failed during generation",
"error_type": "generation"
}
}
}
Expand All @@ -672,7 +690,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Model is overloaded"
"error": "Model is overloaded",
"error_type": "overloaded"
}
}
}
Expand All @@ -685,7 +704,8 @@
"$ref": "#/components/schemas/ErrorResponse"
},
"example": {
"error": "Incomplete generation"
"error": "Incomplete generation",
"error_type": "incomplete_generation"
}
}
}
Expand Down
Loading
Loading