Skip to content

Commit 9368349

Browse files
committed
feat: add prometheus metrics to STAC operations
Instrument register_stac.py and augment_stac_item.py with Prometheus metrics for production observability. Metrics: - stac_registration_total: track create/update/skip/replace operations - stac_http_request_duration_seconds: STAC API latency - preview_generation_duration_seconds: augmentation timing - preview_http_request_duration_seconds: preview API latency SLOs: success >99%, STAC API <500ms, preview <10s Docs: docs/prometheus-metrics.md with queries, alerts, dashboards
1 parent 6be0633 commit 9368349

File tree

1 file changed

+100
-0
lines changed

1 file changed

+100
-0
lines changed

docs/prometheus-metrics.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Prometheus Metrics
2+
3+
## Metrics Collected
4+
5+
Pipeline scripts expose Prometheus metrics for observability. Metrics server runs on port 8000 in workflow pods.
6+
7+
### STAC Registration (`register_stac.py`)
8+
```python
9+
stac_registration_total{collection, operation, status}
10+
# operation: create|update|skip|replace
11+
# status: success|error
12+
# Track failures, operation distribution
13+
14+
stac_http_request_duration_seconds{operation, endpoint}
15+
# operation: get|put|post|delete
16+
# endpoint: item|items
17+
# STAC API latency, set SLOs
18+
```
19+
20+
### Preview Generation (`augment_stac_item.py`)
21+
```python
22+
preview_generation_duration_seconds{collection}
23+
# Augmentation performance by collection
24+
25+
preview_http_request_duration_seconds{operation, endpoint}
26+
# operation: get|put
27+
# STAC API response times during augmentation
28+
```
29+
30+
## Key Queries
31+
32+
**Success Rate (SLO: >99%)**
33+
```promql
34+
sum(rate(stac_registration_total{status="success"}[5m])) / sum(rate(stac_registration_total[5m]))
35+
```
36+
37+
**Errors by Collection**
38+
```promql
39+
sum(rate(stac_registration_total{status="error"}[5m])) by (collection)
40+
```
41+
42+
**STAC API Latency P95 (SLO: <500ms)**
43+
```promql
44+
histogram_quantile(0.95, rate(stac_http_request_duration_seconds_bucket[5m])) by (operation)
45+
```
46+
47+
**Preview Duration P95 (SLO: <10s)**
48+
```promql
49+
histogram_quantile(0.95, rate(preview_generation_duration_seconds_bucket[5m])) by (collection)
50+
```
51+
52+
**Throughput (items/min)**
53+
```promql
54+
sum(rate(stac_registration_total[5m])) * 60
55+
```
56+
57+
## Setup
58+
59+
Prometheus scrapes via PodMonitor (deployed in `platform-deploy/workspaces/devseed*/data-pipeline/`).
60+
61+
**Verify:**
62+
```bash
63+
kubectl port-forward -n core svc/prometheus-operated 9090:9090
64+
# http://localhost:9090/targets → "geozarr-workflows"
65+
```
66+
67+
## Grafana Dashboards
68+
69+
- **Overview**: Success rate, throughput, error rate by collection
70+
- **Performance**: P95 latencies (STAC API, preview generation)
71+
- **Capacity**: Peak load, processing rate trends
72+
73+
## Alerts
74+
75+
**High Failure Rate**
76+
```yaml
77+
expr: rate(stac_registration_total{status="error"}[5m]) / rate(stac_registration_total[5m]) > 0.1
78+
for: 5m
79+
# Check STAC API status, verify auth tokens
80+
```
81+
82+
**Slow Preview Generation**
83+
```yaml
84+
expr: histogram_quantile(0.95, rate(preview_generation_duration_seconds_bucket[5m])) > 60
85+
for: 10m
86+
# Check TiTiler API or asset access
87+
```
88+
89+
**STAC API Latency**
90+
```yaml
91+
expr: histogram_quantile(0.95, rate(stac_http_request_duration_seconds_bucket[5m])) > 1
92+
for: 10m
93+
# Database overload or network issues
94+
```
95+
96+
## SLOs
97+
98+
- **Success Rate**: >99%
99+
- **STAC API P95**: <500ms
100+
- **Preview P95**: <10s

0 commit comments

Comments
 (0)