Skip to content

Commit cfdce59

Browse files
authored
Use the correct vllm metric gpu_cache_usage_perc --> kv_cache_usage_perc (#1905)
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
1 parent a51e074 commit cfdce59

File tree

3 files changed

+3
-3
lines changed

3 files changed

+3
-3
lines changed

docs/proposals/003-model-server-protocol/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ effort.
2929
| ----- | ---- | ------------ | ---- | ---- | ---- |
3030
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
3131
| TotalRunningRequests | Gauge | The current total number of requests actively being served on the model server.| `vllm:num_requests_running`| `nv_trt_llm_request_metrics{request_type=scheduled}`| `sglang:num_running_reqs`
32-
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
32+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:kv_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
3333
| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| |
3434
| [Optional] NumGPUBlocks| Labeled | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| |
3535

pkg/epp/datalayer/metrics/extractor_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ const (
3232
// use hardcoded values - importing causes cycle
3333
defaultTotalQueuedRequestsMetric = "vllm:num_requests_waiting"
3434
defaultTotalRunningRequestsMetric = "vllm:num_requests_running"
35-
defaultKvCacheUsagePercentageMetric = "vllm:gpu_cache_usage_perc"
35+
defaultKvCacheUsagePercentageMetric = "vllm:kv_cache_usage_perc"
3636
defaultLoraInfoMetric = "vllm:lora_requests_info"
3737
defaultCacheInfoMetric = "vllm:cache_config_info"
3838
)

pkg/epp/server/runserver.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ const (
7979
DefaultEnablePprof = true // default for --enable-pprof
8080
DefaultTotalQueuedRequestsMetric = "vllm:num_requests_waiting" // default for --total-queued-requests-metric
8181
DefaultTotalRunningRequestsMetric = "vllm:num_requests_running" // default for --total-running-requests-metric
82-
DefaultKvCacheUsagePercentageMetric = "vllm:gpu_cache_usage_perc" // default for --kv-cache-usage-percentage-metric
82+
DefaultKvCacheUsagePercentageMetric = "vllm:kv_cache_usage_perc" // default for --kv-cache-usage-percentage-metric
8383
DefaultLoraInfoMetric = "vllm:lora_requests_info" // default for --lora-info-metric
8484
DefaultCacheInfoMetric = "vllm:cache_config_info" // default for --cache-info-metric
8585
DefaultCertPath = "" // default for --cert-path

0 commit comments

Comments
 (0)