[metrics] can't enable histogram_latencies metrics and enabling summary metrics result in 33%  performance regression

**Description**
As the title describes

**Triton Information**
I'm using ngc `tritonserver-25.10`

**To Reproduce**
Model: resnet50 tensorrt_backend:
```protobuf
name: "resnet_50"
backend: "tensorrt"
max_batch_size: 8
model_warmup: {
    name: "sample"
    batch_size: 1
    inputs: {
        key: "input"
        value: {
            data_type: TYPE_FP32,
            dims: [3, 224, 224],
	        zero_data: true
        }
    }
}

optimization{
   graph: {
       level : 1
   },
   eager_batching : 1,
   cuda: {
       graphs: 1,
       graph_spec: [
            { batch_size: 1 },
            { batch_size: 2 },
            { batch_size: 3 },
            { batch_size: 4 },
            { batch_size: 5 },
            { batch_size: 6 },
            { batch_size: 7 },
            { batch_size: 8 }
        ]
       busy_wait_events:1,
       output_copy_stream: 0
   }
}


dynamic_batching {
  preferred_batch_size: [4,8]
  max_queue_delay_microseconds: 2000
}
instance_group [ { count: 2 kind: KIND_GPU gpus:[0]}]
```

start cmd following [doc](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html#histograms):
```bash
# try to enable histogram
tritonserver --strict-model-config=0  --model-repository=./model_zoo --log-verbose=1 --metrics-config=histogram_latencies=true

curl 127.0.0.1:8002/metrics
```
But histogram metrics are not contained in the response:
```bash
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="resnet_50",version="1"} 2
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="resnet_50",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="REJECTED",version="1"} 0
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="resnet_50",version="1"} 2
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="resnet_50",version="1"} 2
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="resnet_50",version="1"} 2002
# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
# TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{model="resnet_50",version="1"} 57
# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{model="resnet_50",version="1"} 346
# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{model="resnet_50",version="1"} 1511
# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{model="resnet_50",version="1"} 65
# HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
# TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 119.912
nv_energy_consumption{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 117.864
nv_energy_consumption{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 123.874
nv_energy_consumption{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 123.32
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="resnet_50",version="1"} 0
# HELP nv_model_load_duration_secs Model load time in seconds
# TYPE nv_model_load_duration_secs gauge
nv_model_load_duration_secs{model="resnet_50",version="1"} 0.474947127
# HELP nv_pinned_memory_pool_total_bytes Pinned memory pool total memory size, in bytes
# TYPE nv_pinned_memory_pool_total_bytes gauge
nv_pinned_memory_pool_total_bytes 268435456
# HELP nv_pinned_memory_pool_used_bytes Pinned memory pool used memory size, in bytes
# TYPE nv_pinned_memory_pool_used_bytes gauge
nv_pinned_memory_pool_used_bytes 0
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 0
nv_gpu_utilization{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 0
nv_gpu_utilization{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 0
nv_gpu_utilization{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 0
# HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
# TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 24146608128
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 1505755136
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 60.167
nv_gpu_power_usage{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 59.195
nv_gpu_power_usage{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 62.086
nv_gpu_power_usage{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 61.822
# HELP nv_gpu_power_limit GPU power management limit in watts
# TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 150
nv_gpu_power_limit{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 150
nv_gpu_power_limit{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 150
nv_gpu_power_limit{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 150
# HELP nv_cpu_utilization CPU utilization rate [0.0 - 1.0]
# TYPE nv_cpu_utilization gauge
nv_cpu_utilization 0.00444896402694801
# HELP nv_cpu_memory_total_bytes CPU total memory (RAM), in bytes
# TYPE nv_cpu_memory_total_bytes gauge
nv_cpu_memory_total_bytes 1028544114688
# HELP nv_cpu_memory_used_bytes CPU used memory (RAM), in bytes
# TYPE nv_cpu_memory_used_bytes gauge
nv_cpu_memory_used_bytes 64218046464
```

While we can't provide the model (for private) for performance regression reproduction, here's what we can present: the issue only occurs under a high QPS (0.x million per L20) scenario. The key symptom is that both CPU and GPU utilization plateau below 100%, indicating an unknown bottleneck. Generally, we can easily make GPU utilization climb up to 100% through the `perf_analyzer` benchmark.

<img width="2609" height="1246" alt="Image" src="https://github.com/user-attachments/assets/4af98910-1f4f-467d-a44f-e1771a3c45aa" />

**Expected behavior**
histogram metrics can be enabled.

Enabling summary metrics should not affect throughput.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[metrics] can't enable histogram_latencies metrics and enabling summary metrics result in 33% performance regression #8561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[metrics] can't enable histogram_latencies metrics and enabling summary metrics result in 33% performance regression #8561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions