-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
Description
As the title describes
Triton Information
I'm using ngc tritonserver-25.10
To Reproduce
Model: resnet50 tensorrt_backend:
name: "resnet_50"
backend: "tensorrt"
max_batch_size: 8
model_warmup: {
name: "sample"
batch_size: 1
inputs: {
key: "input"
value: {
data_type: TYPE_FP32,
dims: [3, 224, 224],
zero_data: true
}
}
}
optimization{
graph: {
level : 1
},
eager_batching : 1,
cuda: {
graphs: 1,
graph_spec: [
{ batch_size: 1 },
{ batch_size: 2 },
{ batch_size: 3 },
{ batch_size: 4 },
{ batch_size: 5 },
{ batch_size: 6 },
{ batch_size: 7 },
{ batch_size: 8 }
]
busy_wait_events:1,
output_copy_stream: 0
}
}
dynamic_batching {
preferred_batch_size: [4,8]
max_queue_delay_microseconds: 2000
}
instance_group [ { count: 2 kind: KIND_GPU gpus:[0]}]start cmd following doc:
# try to enable histogram
tritonserver --strict-model-config=0 --model-repository=./model_zoo --log-verbose=1 --metrics-config=histogram_latencies=true
curl 127.0.0.1:8002/metricsBut histogram metrics are not contained in the response:
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="resnet_50",version="1"} 2
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="resnet_50",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="REJECTED",version="1"} 0
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="resnet_50",version="1"} 2
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="resnet_50",version="1"} 2
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="resnet_50",version="1"} 2002
# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
# TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{model="resnet_50",version="1"} 57
# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{model="resnet_50",version="1"} 346
# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{model="resnet_50",version="1"} 1511
# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{model="resnet_50",version="1"} 65
# HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
# TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 119.912
nv_energy_consumption{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 117.864
nv_energy_consumption{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 123.874
nv_energy_consumption{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 123.32
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="resnet_50",version="1"} 0
# HELP nv_model_load_duration_secs Model load time in seconds
# TYPE nv_model_load_duration_secs gauge
nv_model_load_duration_secs{model="resnet_50",version="1"} 0.474947127
# HELP nv_pinned_memory_pool_total_bytes Pinned memory pool total memory size, in bytes
# TYPE nv_pinned_memory_pool_total_bytes gauge
nv_pinned_memory_pool_total_bytes 268435456
# HELP nv_pinned_memory_pool_used_bytes Pinned memory pool used memory size, in bytes
# TYPE nv_pinned_memory_pool_used_bytes gauge
nv_pinned_memory_pool_used_bytes 0
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 0
nv_gpu_utilization{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 0
nv_gpu_utilization{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 0
nv_gpu_utilization{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 0
# HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
# TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 24146608128
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 1505755136
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 60.167
nv_gpu_power_usage{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 59.195
nv_gpu_power_usage{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 62.086
nv_gpu_power_usage{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 61.822
# HELP nv_gpu_power_limit GPU power management limit in watts
# TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 150
nv_gpu_power_limit{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 150
nv_gpu_power_limit{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 150
nv_gpu_power_limit{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 150
# HELP nv_cpu_utilization CPU utilization rate [0.0 - 1.0]
# TYPE nv_cpu_utilization gauge
nv_cpu_utilization 0.00444896402694801
# HELP nv_cpu_memory_total_bytes CPU total memory (RAM), in bytes
# TYPE nv_cpu_memory_total_bytes gauge
nv_cpu_memory_total_bytes 1028544114688
# HELP nv_cpu_memory_used_bytes CPU used memory (RAM), in bytes
# TYPE nv_cpu_memory_used_bytes gauge
nv_cpu_memory_used_bytes 64218046464While we can't provide the model (for private) for performance regression reproduction, here's what we can present: the issue only occurs under a high QPS (0.x million per L20) scenario. The key symptom is that both CPU and GPU utilization plateau below 100%, indicating an unknown bottleneck. Generally, we can easily make GPU utilization climb up to 100% through the perf_analyzer benchmark.
Expected behavior
histogram metrics can be enabled.
Enabling summary metrics should not affect throughput.
Metadata
Metadata
Assignees
Labels
No labels