Skip to content

[metrics] can't enable histogram_latencies metrics and enabling summary metrics result in 33% performance regression #8561

@WingEdge777

Description

@WingEdge777

Description
As the title describes

Triton Information
I'm using ngc tritonserver-25.10

To Reproduce
Model: resnet50 tensorrt_backend:

name: "resnet_50"
backend: "tensorrt"
max_batch_size: 8
model_warmup: {
    name: "sample"
    batch_size: 1
    inputs: {
        key: "input"
        value: {
            data_type: TYPE_FP32,
            dims: [3, 224, 224],
	        zero_data: true
        }
    }
}

optimization{
   graph: {
       level : 1
   },
   eager_batching : 1,
   cuda: {
       graphs: 1,
       graph_spec: [
            { batch_size: 1 },
            { batch_size: 2 },
            { batch_size: 3 },
            { batch_size: 4 },
            { batch_size: 5 },
            { batch_size: 6 },
            { batch_size: 7 },
            { batch_size: 8 }
        ]
       busy_wait_events:1,
       output_copy_stream: 0
   }
}


dynamic_batching {
  preferred_batch_size: [4,8]
  max_queue_delay_microseconds: 2000
}
instance_group [ { count: 2 kind: KIND_GPU gpus:[0]}]

start cmd following doc:

# try to enable histogram
tritonserver --strict-model-config=0  --model-repository=./model_zoo --log-verbose=1 --metrics-config=histogram_latencies=true

curl 127.0.0.1:8002/metrics

But histogram metrics are not contained in the response:

# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="resnet_50",version="1"} 2
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="resnet_50",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="resnet_50",reason="REJECTED",version="1"} 0
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="resnet_50",version="1"} 2
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="resnet_50",version="1"} 2
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="resnet_50",version="1"} 2002
# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
# TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{model="resnet_50",version="1"} 57
# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{model="resnet_50",version="1"} 346
# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{model="resnet_50",version="1"} 1511
# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{model="resnet_50",version="1"} 65
# HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
# TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 119.912
nv_energy_consumption{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 117.864
nv_energy_consumption{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 123.874
nv_energy_consumption{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 123.32
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="resnet_50",version="1"} 0
# HELP nv_model_load_duration_secs Model load time in seconds
# TYPE nv_model_load_duration_secs gauge
nv_model_load_duration_secs{model="resnet_50",version="1"} 0.474947127
# HELP nv_pinned_memory_pool_total_bytes Pinned memory pool total memory size, in bytes
# TYPE nv_pinned_memory_pool_total_bytes gauge
nv_pinned_memory_pool_total_bytes 268435456
# HELP nv_pinned_memory_pool_used_bytes Pinned memory pool used memory size, in bytes
# TYPE nv_pinned_memory_pool_used_bytes gauge
nv_pinned_memory_pool_used_bytes 0
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 0
nv_gpu_utilization{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 0
nv_gpu_utilization{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 0
nv_gpu_utilization{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 0
# HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
# TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 24146608128
nv_gpu_memory_total_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 24146608128
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 1231028224
nv_gpu_memory_used_bytes{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 1505755136
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 60.167
nv_gpu_power_usage{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 59.195
nv_gpu_power_usage{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 62.086
nv_gpu_power_usage{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 61.822
# HELP nv_gpu_power_limit GPU power management limit in watts
# TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-1db888ee-53ba-1696-12a4-119f64795f65"} 150
nv_gpu_power_limit{gpu_uuid="GPU-81348efe-52ea-e254-0b9c-5949c10c6c33"} 150
nv_gpu_power_limit{gpu_uuid="GPU-84e52c62-b59a-d66e-e7d9-971438b98c4e"} 150
nv_gpu_power_limit{gpu_uuid="GPU-b596c515-4806-a834-d144-17c6c39b6adf"} 150
# HELP nv_cpu_utilization CPU utilization rate [0.0 - 1.0]
# TYPE nv_cpu_utilization gauge
nv_cpu_utilization 0.00444896402694801
# HELP nv_cpu_memory_total_bytes CPU total memory (RAM), in bytes
# TYPE nv_cpu_memory_total_bytes gauge
nv_cpu_memory_total_bytes 1028544114688
# HELP nv_cpu_memory_used_bytes CPU used memory (RAM), in bytes
# TYPE nv_cpu_memory_used_bytes gauge
nv_cpu_memory_used_bytes 64218046464

While we can't provide the model (for private) for performance regression reproduction, here's what we can present: the issue only occurs under a high QPS (0.x million per L20) scenario. The key symptom is that both CPU and GPU utilization plateau below 100%, indicating an unknown bottleneck. Generally, we can easily make GPU utilization climb up to 100% through the perf_analyzer benchmark.

Image

Expected behavior
histogram metrics can be enabled.

Enabling summary metrics should not affect throughput.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions