-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
Hello. Needless to say, amazing library. Please let me know if you'd like me to try something or if you need more info.
I've been going through various local model providers trying to find one that works well, when I cam across a rather shocking bug when running against Huggingface's TGI model host.
The problem appears whether using the OpenAI "compatible" endpoints or the HuggingfaceModel with custom AsyncInferenceClient and HuggingFaceProvider. The latter probably being the official approach, the code included here will be using that.
System Info
curl 127.0.0.1:8080/info | jq:
{
"model_id": "/models/meta-llama/Meta-Llama-3-8B-Instruct",
"model_sha": null,
"model_pipeline_tag": null,
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_tokens": 8191,
"max_total_tokens": 8192,
"validation_workers": 2,
"max_client_batch_size": 4,
"router": "text-generation-router",
"version": "3.3.4-dev0",
"sha": "9f38d9305168f4b47c8c46b573f5b2c07881281d",
"docker_label": "sha-9f38d93"
}nvidia-smi:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off |
| 40% 54C P2 61W / 450W | 21499MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:48:00.0 Off | Off |
| 30% 43C P2 52W / 450W | 21394MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Setup
Here's the docker-compose.yaml I'm using to start TGI:
services:
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:latest
container_name: tgi
ports:
- "8081:80"
volumes:
- ../../../models:/models:ro
- tgi-data:/data
environment:
- RUST_LOG=info
# I have also tested with 3.1-8B and 3.2-3B with the same end results
command: >
--model-id /models/meta-llama/Meta-Llama-3-8B-Instruct
--hostname 0.0.0.0
--port 80
--trust-remote-code
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
shm_size: "64g"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
tgi-data:
driver: localCode
All code is running in a Jupyter notebook.
Here's the common setup cell:
from huggingface_hub import AsyncInferenceClient
from pydantic_ai.models.huggingface import HuggingFaceModel
from pydantic_ai.providers.huggingface import HuggingFaceProvider
from pydantic_ai.providers.openai import OpenAIProvider
provider = OpenAIProvider(base_url="http://localhost:8081/v1") # Just used to get the model slug
models = await provider.client.models.list()
client = AsyncInferenceClient(base_url="http://localhost:8081/")
print(f"Connected to TGI. Available models: {len(models.data)}")
for model in models.data:
print(f" - {model.id}")
# Create the model instance
agent_model = HuggingFaceModel(
models.data[0].id,
provider=HuggingFaceProvider(hf_client=client, api_key="None"),
# Annoyingly, despite this being basically the default profile, Llama 3's tool calls often fall through to the response without this
profile=ModelProfile(
supports_tools=True,
json_schema_transformer=InlineDefsJsonSchemaTransformer
)
)Working: Basic requests and history
- Create the basic agent
from pydantic_ai import Agent
simple_agent = Agent(model=agent_model)- Make a simple request
simple_result = await simple_agent.run("Tell me a joke.")
simple_result.output # "Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!"- Test including previous messages in another simple request
simple_result_2 = await simple_agent.run( message_history=simple_result.all_messages()
simple_result_2.output # 'Why did the scarecrow win an award?\n\nBecause he was outstanding in his field! (get it?)'Not working (or sometimes "working" with like 20 tool calls)
- Create the agent and a basic function
from pydantic_ai import Tool
from pydantic_ai.toolsets import FunctionToolset
from datetime import datetime
# Create a simple tool
@Tool
async def get_current_date() -> str:
"""Get the current date.
Returns:
str: The current date in YYYY-MM-DD format.
"""
return datetime.now().strftime("%Y-%m-%d")
# Create an agent with the simple tool
tool_agent = Agent(model=agent_model, tools=[get_current_date])- Make a simple request that should use the tool call
tool_result = await tool_agent.run("What is the current date?")
tool_result.output # 'I apologize for the repetition! According to my system clock, the current date is indeed August 31st, 2025.'- Hmm. 8 seconds for that request? Let's inspect the messages
for message in tool_result.all_messages():
print(message)Which yields something like:
ModelRequest(parts=[UserPromptPart(content='What is the current date?', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, 70324, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=175, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, 288467, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=219, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, 505643, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=262, output_tokens=12), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, 674762, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=305, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 16, 851700, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=348, output_tokens=15), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, 65279, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=391, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, 286718, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=434, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, 480682, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=477, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, 696462, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=520, output_tokens=15), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 17, 907846, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=563, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, 152962, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=606, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, 337485, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=649, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, 528383, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=692, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, 760306, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=735, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 18, 995073, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=778, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, 186872, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=821, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, 426914, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=864, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, 653267, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=907, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 19, 877281, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=950, output_tokens=17), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, 124358, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=993, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, 319587, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1036, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, 517817, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1079, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, 709416, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1122, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 20, 946267, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1165, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, 183936, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1208, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, 389117, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1251, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, 621889, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1294, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 21, 847334, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1337, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, 39434, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1380, output_tokens=19), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, 300561, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1423, output_tokens=17), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, 536096, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1466, output_tokens=15), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, 752334, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1509, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 22, 941799, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1552, output_tokens=16), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, 174612, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1595, output_tokens=14), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, 387760, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[ToolCallPart(tool_name='get_current_date', args='{}', tool_call_id='0')], usage=RequestUsage(input_tokens=1638, output_tokens=13), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
ModelRequest(parts=[ToolReturnPart(tool_name='get_current_date', content='2025-08-31', tool_call_id='0', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, 587324, tzinfo=datetime.timezone.utc))])
ModelResponse(parts=[TextPart(content='I apologize for the repetition! According to my system clock, the current date is indeed August 31st, 2025.')], usage=RequestUsage(input_tokens=1521, output_tokens=12), model_name='/models/meta-llama/Meta-Llama-3-8B-Instruct', timestamp=datetime.datetime(2025, 8, 31, 8, 14, 23, tzinfo=datetime.timezone.utc), provider_name='huggingface', provider_request_id='')
35 tool calls!
Here's a log from TGI from one of the calls
INFO chat_completions{parameters="GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: None, return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: Some(Json(Object {\"$functions\": Object {\"get_current_date\": Object {\"description\": String(\"<summary>Get the current date.</summary>\\n<returns>\\n<type>str</type>\\n<description>The current date in YYYY-MM-DD format.</description>\\n</returns>\"), \"additionalProperties\": Bool(false), \"properties\": Object {\"_name\": Object {\"type\": String(\"string\"), \"const\": String(\"get_current_date\")}}, \"required\": Array [String(\"_name\")]}, \"no_tool\": Object {\"description\": String(\"Open ended response with no specific tool selected\"), \"additionalProperties\": Bool(false), \"properties\": Object {\"_name\": Object {\"type\": String(\"string\"), \"const\": String(\"no_tool\")}}, \"required\": Array [String(\"_name\")]}}, \"properties\": Object {\"function\": Object {\"anyOf\": Array [Object {\"$ref\": String(\"#/$functions/get_current_date\")}, Object {\"$ref\": String(\"#/$functions/no_tool\")}]}}})), adapter_id: Some(\"/models/meta-llama/Meta-Llama-3-8B-Instruct\") }" total_time="180.268942ms" validation_time="1.161794ms" queue_time="46.08µs" inference_time="179.061248ms" time_per_token="14.92177ms" seed="Some(6476155871046790452)" total_time="349.189932ms" validation_time="948.707µs" queue_time="38.419µs" inference_time="348.202936ms" time_per_token="12.896405ms" seed="Some(5246360728990037330)"}: text_generation_router::server: router/src/server.rs:432: Success
Expected behavior
I'd understand if it failed to call the tool, but getting the current date 35 times is a bit much! Ideally, the HuggingfaceModel would work with TGI and tool calls.