@@ -10,6 +10,93 @@ This directory contains two controllers:
1010
1111![ Approval Controller and Metric Collector Architecture] ( ./approval-controller-metric-collector.drawio.png )
1212
13+ ## How It Works
14+
15+ ### Custom Resource Definitions (CRDs)
16+
17+ This solution introduces three new CRDs that work together with KubeFleet's native resources:
18+
19+ #### Hub Cluster CRDs
20+
21+ 1 . ** MetricCollector** (cluster-scoped)
22+ - Defines Prometheus connection details and where to report metrics
23+ - Gets propagated to member clusters via ClusterResourcePlacement (CRP)
24+ - Each member cluster receives a customized version with its specific ` reportNamespace `
25+
26+ 2 . ** MetricCollectorReport** (namespaced)
27+ - Created by metric-collector on member clusters, reported back to hub
28+ - Lives in ` fleet-member-<cluster-name> ` namespaces on the hub
29+ - Contains collected ` workload_health ` metrics for all workloads in a cluster
30+ - Updated every 30 seconds by the metric collector
31+
32+ 3 . ** WorkloadTracker** (cluster-scoped)
33+ - Defines which workloads to monitor and their health thresholds
34+ - Specifies namespace, workload name, and expected health status
35+ - Used by approval-request-controller to determine if stage is ready for approval
36+
37+ ### Automated Approval Flow
38+
39+ 1 . ** Stage Initialization**
40+ - User creates an UpdateRun (` ClusterStagedUpdateRun ` or ` StagedUpdateRun ` ) on the hub
41+ - KubeFleet creates an ApprovalRequest (` ClusterApprovalRequest ` or ` ApprovalRequest ` ) for the first stage
42+ - The ApprovalRequest enters "Pending" state, waiting for approval
43+
44+ 2 . ** Metric Collector Deployment**
45+ - Approval-request-controller watches the CAR
46+ - Creates a ` MetricCollector ` resource on the hub (cluster-scoped)
47+ - Creates a ` ClusterResourceOverride ` with per-cluster customization rules
48+ - Each cluster gets a unique ` reportNamespace ` : ` fleet-member-<cluster-name> `
49+ - Creates a ` ClusterResourcePlacement ` (CRP) with ` PickFixed ` policy
50+ - Targets all clusters in the current stage
51+ - KubeFleet propagates the customized ` MetricCollector ` to each member cluster
52+
53+ 3 . ** Metric Collection on Member Clusters**
54+ - Metric-collector controller runs on each member cluster
55+ - Every 30 seconds, it:
56+ - Queries local Prometheus with PromQL: ` workload_health `
57+ - Prometheus returns metrics for all pods with ` prometheus.io/scrape: "true" ` annotation
58+ - Extracts workload health (1.0 = healthy, 0.0 = unhealthy)
59+ - Creates/updates ` MetricCollectorReport ` on hub in ` fleet-member-<cluster-name> ` namespace
60+
61+ 4 . ** Health Evaluation**
62+ - Approval-request-controller monitors ` MetricCollectorReports ` from all stage clusters
63+ - Every 15 seconds, it:
64+ - Fetches the ` WorkloadTracker ` to know which workloads to check
65+ - For each cluster in the stage:
66+ - Reads its ` MetricCollectorReport ` from ` fleet-member-<cluster-name> ` namespace
67+ - Verifies all tracked workloads are present and healthy
68+ - If any workload is missing or unhealthy, waits for next cycle
69+ - If ALL workloads across ALL clusters are healthy:
70+ - Sets ApprovalRequest condition ` Approved: True `
71+ - KubeFleet proceeds to roll out the stage
72+
73+ 5 . ** Stage Progression**
74+ - KubeFleet applies the update to the approved stage clusters
75+ - Creates a new ApprovalRequest for the next stage (if any)
76+ - The cycle repeats for each stage
77+
78+ ### Key Design Decisions
79+
80+ ** Why ClusterResourceOverride?**
81+ - Each member cluster needs to report to a different namespace on the hub
82+ - The override injects the cluster-specific ` reportNamespace ` before deployment
83+ - This allows a single MetricCollector definition to work across all clusters
84+
85+ ** Why PickFixed Placement Policy?**
86+ - Stages may target different subsets of clusters
87+ - PickFixed ensures MetricCollector only deploys to clusters in the current stage
88+ - Avoids collecting metrics from clusters not involved in the stage
89+
90+ ** Why 15-second polling for approval?**
91+ - Balances responsiveness with control plane load
92+ - Gives clusters time to stabilize after rollout
93+ - Allows detection of workload health degradation
94+
95+ ** Why cluster-scoped MetricCollector?**
96+ - Simplifies propagation via CRP (no namespace matching issues)
97+ - Single resource definition covers all namespaces
98+ - Consistent with KubeFleet's placement model
99+
13100## Prerequisites
14101
15102- Docker or Podman for building images
@@ -264,22 +351,19 @@ kubectl get clusterapprovalrequest -A
264351kubectl describe clusterapprovalrequest < approval-request-name>
265352```
266353
267- ## How It Works
268-
269- 1 . ** Metric Collection** : The standalone-metric-collector on each member cluster queries Prometheus for ` workload_health ` metrics
270- 2 . ** Report Creation** : Collectors create MetricCollectorReport resources on the hub cluster
271- 3 . ** Health Monitoring** : The approval-request-controller watches ApprovalRequest resources and corresponding MetricCollectorReports
272- 4 . ** Automatic Approval** : When all workloads meet health thresholds defined in WorkloadTracker specs, the controller approves the staged update
273-
274354## Configuration
275355
276356### Approval Request Controller
277357- Located in ` approval-request-controller/charts/approval-request-controller/values.yaml `
278358- Key settings: log level, resource limits, RBAC, CRD installation
359+ - Default Prometheus URL: ` http://prometheus.prometheus.svc.cluster.local:9090 `
360+ - Reconciliation interval: 15 seconds
279361
280362### Metric Collector
281- - Located in ` standalone-metric-collector/charts/metric-collector/values.yaml `
282- - Key settings: hub cluster URL, Prometheus URL, member cluster name, sync interval
363+ - Located in ` metric-collector/charts/metric-collector/values.yaml `
364+ - Key settings: hub cluster URL, Prometheus URL, member cluster name
365+ - Metric collection interval: 30 seconds
366+ - Connects to hub using service account token
283367
284368## Troubleshooting
285369
0 commit comments