|
| 1 | +# Prometheus Metrics |
| 2 | + |
| 3 | +## Metrics Collected |
| 4 | + |
| 5 | +Pipeline scripts expose Prometheus metrics for observability. Metrics server runs on port 8000 in workflow pods. |
| 6 | + |
| 7 | +### STAC Registration (`register_stac.py`) |
| 8 | +```python |
| 9 | +stac_registration_total{collection, operation, status} |
| 10 | +# operation: create|update|skip|replace |
| 11 | +# status: success|error |
| 12 | +# Track failures, operation distribution |
| 13 | + |
| 14 | +stac_http_request_duration_seconds{operation, endpoint} |
| 15 | +# operation: get|put|post|delete |
| 16 | +# endpoint: item|items |
| 17 | +# STAC API latency, set SLOs |
| 18 | +``` |
| 19 | + |
| 20 | +### Preview Generation (`augment_stac_item.py`) |
| 21 | +```python |
| 22 | +preview_generation_duration_seconds{collection} |
| 23 | +# Augmentation performance by collection |
| 24 | + |
| 25 | +preview_http_request_duration_seconds{operation, endpoint} |
| 26 | +# operation: get|put |
| 27 | +# STAC API response times during augmentation |
| 28 | +``` |
| 29 | + |
| 30 | +## Key Queries |
| 31 | + |
| 32 | +**Success Rate (SLO: >99%)** |
| 33 | +```promql |
| 34 | +sum(rate(stac_registration_total{status="success"}[5m])) / sum(rate(stac_registration_total[5m])) |
| 35 | +``` |
| 36 | + |
| 37 | +**Errors by Collection** |
| 38 | +```promql |
| 39 | +sum(rate(stac_registration_total{status="error"}[5m])) by (collection) |
| 40 | +``` |
| 41 | + |
| 42 | +**STAC API Latency P95 (SLO: <500ms)** |
| 43 | +```promql |
| 44 | +histogram_quantile(0.95, rate(stac_http_request_duration_seconds_bucket[5m])) by (operation) |
| 45 | +``` |
| 46 | + |
| 47 | +**Preview Duration P95 (SLO: <10s)** |
| 48 | +```promql |
| 49 | +histogram_quantile(0.95, rate(preview_generation_duration_seconds_bucket[5m])) by (collection) |
| 50 | +``` |
| 51 | + |
| 52 | +**Throughput (items/min)** |
| 53 | +```promql |
| 54 | +sum(rate(stac_registration_total[5m])) * 60 |
| 55 | +``` |
| 56 | + |
| 57 | +## Setup |
| 58 | + |
| 59 | +Prometheus scrapes via PodMonitor (deployed in `platform-deploy/workspaces/devseed*/data-pipeline/`). |
| 60 | + |
| 61 | +**Verify:** |
| 62 | +```bash |
| 63 | +kubectl port-forward -n core svc/prometheus-operated 9090:9090 |
| 64 | +# http://localhost:9090/targets → "geozarr-workflows" |
| 65 | +``` |
| 66 | + |
| 67 | +## Grafana Dashboards |
| 68 | + |
| 69 | +- **Overview**: Success rate, throughput, error rate by collection |
| 70 | +- **Performance**: P95 latencies (STAC API, preview generation) |
| 71 | +- **Capacity**: Peak load, processing rate trends |
| 72 | + |
| 73 | +## Alerts |
| 74 | + |
| 75 | +**High Failure Rate** |
| 76 | +```yaml |
| 77 | +expr: rate(stac_registration_total{status="error"}[5m]) / rate(stac_registration_total[5m]) > 0.1 |
| 78 | +for: 5m |
| 79 | +# Check STAC API status, verify auth tokens |
| 80 | +``` |
| 81 | + |
| 82 | +**Slow Preview Generation** |
| 83 | +```yaml |
| 84 | +expr: histogram_quantile(0.95, rate(preview_generation_duration_seconds_bucket[5m])) > 60 |
| 85 | +for: 10m |
| 86 | +# Check TiTiler API or asset access |
| 87 | +``` |
| 88 | + |
| 89 | +**STAC API Latency** |
| 90 | +```yaml |
| 91 | +expr: histogram_quantile(0.95, rate(stac_http_request_duration_seconds_bucket[5m])) > 1 |
| 92 | +for: 10m |
| 93 | +# Database overload or network issues |
| 94 | +``` |
| 95 | + |
| 96 | +## SLOs |
| 97 | + |
| 98 | +- **Success Rate**: >99% |
| 99 | +- **STAC API P95**: <500ms |
| 100 | +- **Preview P95**: <10s |
0 commit comments