Skip to content

Conversation

@Arvindthiru
Copy link

@Arvindthiru Arvindthiru commented Dec 6, 2025

This PR introduces a complete solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus.

What's Added:

Two Standalone Controllers:

Approval-Request-Controller (hub cluster): Watches ApprovalRequests/ClusterApprovalRequests, creates MetricCollectorReport resources directly in fleet-member-* namespaces on the hub, evaluates workload health, and auto-approves stages when all tracked workloads are healthy
Metric-Collector (member clusters): Connects to hub cluster to watch MetricCollectorReport in its fleet-member namespace, queries local Prometheus every 30 seconds for workload health metrics, and updates the report status on hub

Custom Resources:

MetricCollectorReport (hub cluster): Created by approval-request-controller in fleet-member-* namespaces, contains Prometheus URL spec and collected health metrics in status, updated by metric-collector running on member clusters
ClusterStagedWorkloadTracker: Specifies which workloads must be healthy before approving stages in ClusterStagedUpdateRun (cluster-scoped)
StagedWorkloadTracker: Specifies which workloads must be healthy before approving stages in StagedUpdateRun (namespace-scoped)

Architecture:

Approval-request-controller creates MetricCollectorReport resources on hub (no deployment to members)
Metric-collector on each member connects to hub using service account token
Simple token-based authentication with no certificate or CA verification
Approval controller checks health every 15 seconds; metric collector updates every 30 seconds

Build & Deployment:

Makefile with commands for building all three Docker images (approval-request-controller, metric-collector, metric-app)
Automated installation scripts that can be run from approval-request-metric-collector directory
Scripts handle service account creation, RBAC setup, and Helm deployment

Documentation:

Main tutorial with complete end-to-end setup guide including ACR setup
Controller-specific READMEs
Example configurations for Prometheus, staged updates, and workload tracking
Detailed architecture diagrams and flow explanations

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Arvind Thirumurugan added 2 commits December 10, 2025 02:03
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
@Arvindthiru Arvindthiru marked this pull request as ready for review December 10, 2025 10:38
Copilot AI review requested due to automatic review settings December 10, 2025 10:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus. The implementation adds two standalone controllers (approval-request-controller on hub, metric-collector on members) and four custom resources to enable automated staged rollout approvals.

Key Changes:

  • Two standalone Kubernetes controllers for metric-based approval automation
  • Four new CRDs for metric collection and workload tracking
  • Complete documentation and installation scripts for both controllers
  • Integration with KubeFleet v0.1.2 for staged update orchestration

Reviewed changes

Copilot reviewed 64 out of 67 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
approval-request-controller/go.mod Module definition with invalid Go version 1.24.9
approval-request-controller/pkg/controller/controller.go Main approval logic that watches ApprovalRequests and auto-approves based on metrics
approval-request-controller/apis/metric/v1alpha1/*.go Custom resource type definitions for MetricCollector, Reports, and WorkloadTrackers
metric-collector/go.mod Module definition with invalid Go version 1.24.9
metric-collector/pkg/controller/*.go Member cluster controller for collecting Prometheus metrics
/docker/.Dockerfile Container build files using invalid Go 1.24 base images
/install-on-.sh Installation scripts for hub and member cluster deployments
/charts/ Helm charts for deploying both controllers
/examples/ Example configurations for Prometheus, CRPs, and workload trackers
README.md Comprehensive tutorial covering setup, architecture, and usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Arvind Thirumurugan added 5 commits December 10, 2025 16:47
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments, PTAL

@michaelawyu
Copy link
Contributor

Hi Arvind! Just some of my two cents on the high level:

a) Arch-wise the design seems to be a bit too complex: for example, the whole metric data passing process can be done easily with one API but now it uses two separate APIs + the CRP/override API to complete the job.

@michaelawyu
Copy link
Contributor

b) I understand that it's demo code so we want to focus more on the showcasing side, and that's probably the reason why in the code the controller is basically expecting one static metric (gauge type) from the host cluster -> but if that's the case we should be quite straightforward about this in the code and in the doc, and the API should get greatly simplified. Alternatively we could allow users to specific custom queries, which would make the code more useful (and more complex, of course)

@michaelawyu
Copy link
Contributor

c) the folder structure could use some work. I feel that an organization like our main repo would be more comprehensible; currently everything is a bit scattered (with soft links connecting the duplicates), e.g., the APIs are all kept on the approval controller part. Doc wise I fear that for users without enough context they might find it difficult to grasp what the demo is really for.

Arvind Thirumurugan added 5 commits December 11, 2025 15:29
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Arvind Thirumurugan added 3 commits December 14, 2025 23:37
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Arvind Thirumurugan added 18 commits December 15, 2025 11:54
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants