-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Summary
We just had an incident due to CRP object's ClusterResourcePlacementScheduled flipping to Unknown for probably not a good reason, and the CRP controller ending up requeueing the objects, which caused workqueue to be clogged.
We were propagating ~10,000 namespaces with the Hub agent, and it took the controller with 40 workers 3 hours to finish reconciling these objects.
Background
Step 1
We decided to change CRP.spec.strategy.rollingUpdate.maxUnavailable field of ~10,000 CRPs to a different value (we manage the CRP objects with another in-house controller)
Step 2
This change caused ClusterResourcePlacementScheduled condition to change:
conditions:
- type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2025-10-08T17:24:04Z"
- message: found all the clusters needed as specified by the scheduling policy
- observedGeneration: 6
- reason: SchedulingPolicyFulfilled
- status: "True"
+ lastTransitionTime: "2025-10-08T17:55:42Z"
+ message: Scheduling has not completed
+ observedGeneration: 7
+ reason: SchedulePending
+ status: UnknownThis happens at
kubefleet/pkg/controllers/placement/controller.go
Lines 1194 to 1202 in 39ed26b
| latestSchedulingPolicySnapshot.GetPolicySnapshotStatus().ObservedCRPGeneration < placementObj.GetGeneration() || | |
| scheduledCondition.Status == metav1.ConditionUnknown { | |
| return metav1.Condition{ | |
| Status: metav1.ConditionUnknown, | |
| Type: getPlacementScheduledConditionType(placementObj), | |
| Reason: condition.SchedulingUnknownReason, | |
| Message: "Scheduling has not completed", | |
| ObservedGeneration: placementObj.GetGeneration(), | |
| } |
Mind you we only have very basic placement rule, so we don't even know why these are getting set to Unknown:
spec:
policy:
clusterNames:
- lit-lca1-1-k8s-1
placementType: PickFixed
Step 3
Hub agent's controller sees the changed items and adds them to the queue, workqueue starts building up:
Step 4
Hub agent's cluster-resource-placement-controller-v1beta1's logs are flooded with the following message saying scheduling condition is Unknown:
clusterresourceplacement/controller.go:280] "Scheduler has not scheduled any cluster yet and requeue the request as a backup"
clusterResourcePlacement="proxima-grpc-pod-26526" scheduledCondition="&Condition{Type:ClusterResourcePlacementScheduled,Status:Unknown,ObservedGeneration:4,LastTransitionTime:2025-10-07 23:16:32 +0000 UTC,Reason:SchedulePending,Message:Scheduling has not completed,}"
generation=4
...
and proceeds to requeue everything:
tl;dr
Setting ClusterResourcePlacementScheduled condition on CRP.spec changes to Unknown essentially makes Hub controller unable to process 10,000 objects, even with 40 workers (because it uses 1/10 of them, #273).

