Skip to content

Commit cec4270

Browse files
author
Arvind Thirumurugan
committed
improve readme
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
1 parent 871a120 commit cec4270

File tree

1 file changed

+93
-9
lines changed
  • approval-controller-metric-collector

1 file changed

+93
-9
lines changed

approval-controller-metric-collector/README.md

Lines changed: 93 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,93 @@ This directory contains two controllers:
1010

1111
![Approval Controller and Metric Collector Architecture](./approval-controller-metric-collector.drawio.png)
1212

13+
## How It Works
14+
15+
### Custom Resource Definitions (CRDs)
16+
17+
This solution introduces three new CRDs that work together with KubeFleet's native resources:
18+
19+
#### Hub Cluster CRDs
20+
21+
1. **MetricCollector** (cluster-scoped)
22+
- Defines Prometheus connection details and where to report metrics
23+
- Gets propagated to member clusters via ClusterResourcePlacement (CRP)
24+
- Each member cluster receives a customized version with its specific `reportNamespace`
25+
26+
2. **MetricCollectorReport** (namespaced)
27+
- Created by metric-collector on member clusters, reported back to hub
28+
- Lives in `fleet-member-<cluster-name>` namespaces on the hub
29+
- Contains collected `workload_health` metrics for all workloads in a cluster
30+
- Updated every 30 seconds by the metric collector
31+
32+
3. **WorkloadTracker** (cluster-scoped)
33+
- Defines which workloads to monitor and their health thresholds
34+
- Specifies namespace, workload name, and expected health status
35+
- Used by approval-request-controller to determine if stage is ready for approval
36+
37+
### Automated Approval Flow
38+
39+
1. **Stage Initialization**
40+
- User creates an UpdateRun (`ClusterStagedUpdateRun` or `StagedUpdateRun`) on the hub
41+
- KubeFleet creates an ApprovalRequest (`ClusterApprovalRequest` or `ApprovalRequest`) for the first stage
42+
- The ApprovalRequest enters "Pending" state, waiting for approval
43+
44+
2. **Metric Collector Deployment**
45+
- Approval-request-controller watches the CAR
46+
- Creates a `MetricCollector` resource on the hub (cluster-scoped)
47+
- Creates a `ClusterResourceOverride` with per-cluster customization rules
48+
- Each cluster gets a unique `reportNamespace`: `fleet-member-<cluster-name>`
49+
- Creates a `ClusterResourcePlacement` (CRP) with `PickFixed` policy
50+
- Targets all clusters in the current stage
51+
- KubeFleet propagates the customized `MetricCollector` to each member cluster
52+
53+
3. **Metric Collection on Member Clusters**
54+
- Metric-collector controller runs on each member cluster
55+
- Every 30 seconds, it:
56+
- Queries local Prometheus with PromQL: `workload_health`
57+
- Prometheus returns metrics for all pods with `prometheus.io/scrape: "true"` annotation
58+
- Extracts workload health (1.0 = healthy, 0.0 = unhealthy)
59+
- Creates/updates `MetricCollectorReport` on hub in `fleet-member-<cluster-name>` namespace
60+
61+
4. **Health Evaluation**
62+
- Approval-request-controller monitors `MetricCollectorReports` from all stage clusters
63+
- Every 15 seconds, it:
64+
- Fetches the `WorkloadTracker` to know which workloads to check
65+
- For each cluster in the stage:
66+
- Reads its `MetricCollectorReport` from `fleet-member-<cluster-name>` namespace
67+
- Verifies all tracked workloads are present and healthy
68+
- If any workload is missing or unhealthy, waits for next cycle
69+
- If ALL workloads across ALL clusters are healthy:
70+
- Sets ApprovalRequest condition `Approved: True`
71+
- KubeFleet proceeds to roll out the stage
72+
73+
5. **Stage Progression**
74+
- KubeFleet applies the update to the approved stage clusters
75+
- Creates a new ApprovalRequest for the next stage (if any)
76+
- The cycle repeats for each stage
77+
78+
### Key Design Decisions
79+
80+
**Why ClusterResourceOverride?**
81+
- Each member cluster needs to report to a different namespace on the hub
82+
- The override injects the cluster-specific `reportNamespace` before deployment
83+
- This allows a single MetricCollector definition to work across all clusters
84+
85+
**Why PickFixed Placement Policy?**
86+
- Stages may target different subsets of clusters
87+
- PickFixed ensures MetricCollector only deploys to clusters in the current stage
88+
- Avoids collecting metrics from clusters not involved in the stage
89+
90+
**Why 15-second polling for approval?**
91+
- Balances responsiveness with control plane load
92+
- Gives clusters time to stabilize after rollout
93+
- Allows detection of workload health degradation
94+
95+
**Why cluster-scoped MetricCollector?**
96+
- Simplifies propagation via CRP (no namespace matching issues)
97+
- Single resource definition covers all namespaces
98+
- Consistent with KubeFleet's placement model
99+
13100
## Prerequisites
14101

15102
- Docker or Podman for building images
@@ -264,22 +351,19 @@ kubectl get clusterapprovalrequest -A
264351
kubectl describe clusterapprovalrequest <approval-request-name>
265352
```
266353

267-
## How It Works
268-
269-
1. **Metric Collection**: The standalone-metric-collector on each member cluster queries Prometheus for `workload_health` metrics
270-
2. **Report Creation**: Collectors create MetricCollectorReport resources on the hub cluster
271-
3. **Health Monitoring**: The approval-request-controller watches ApprovalRequest resources and corresponding MetricCollectorReports
272-
4. **Automatic Approval**: When all workloads meet health thresholds defined in WorkloadTracker specs, the controller approves the staged update
273-
274354
## Configuration
275355

276356
### Approval Request Controller
277357
- Located in `approval-request-controller/charts/approval-request-controller/values.yaml`
278358
- Key settings: log level, resource limits, RBAC, CRD installation
359+
- Default Prometheus URL: `http://prometheus.prometheus.svc.cluster.local:9090`
360+
- Reconciliation interval: 15 seconds
279361

280362
### Metric Collector
281-
- Located in `standalone-metric-collector/charts/metric-collector/values.yaml`
282-
- Key settings: hub cluster URL, Prometheus URL, member cluster name, sync interval
363+
- Located in `metric-collector/charts/metric-collector/values.yaml`
364+
- Key settings: hub cluster URL, Prometheus URL, member cluster name
365+
- Metric collection interval: 30 seconds
366+
- Connects to hub using service account token
283367

284368
## Troubleshooting
285369

0 commit comments

Comments
 (0)