Skip to content

ClusterResourcePlacementScheduled condition gets reset in CRP spec updates #275

@ahmetb

Description

@ahmetb

Summary

We just had an incident due to CRP object's ClusterResourcePlacementScheduled flipping to Unknown for probably not a good reason, and the CRP controller ending up requeueing the objects, which caused workqueue to be clogged.

We were propagating ~10,000 namespaces with the Hub agent, and it took the controller with 40 workers 3 hours to finish reconciling these objects.

Background

Step 1

We decided to change CRP.spec.strategy.rollingUpdate.maxUnavailable field of ~10,000 CRPs to a different value (we manage the CRP objects with another in-house controller)

Step 2

This change caused ClusterResourcePlacementScheduled condition to change:

     conditions:
     - type: ClusterResourcePlacementScheduled
-      lastTransitionTime: "2025-10-08T17:24:04Z"
-      message: found all the clusters needed as specified by the scheduling policy
-      observedGeneration: 6
-      reason: SchedulingPolicyFulfilled
-      status: "True"
+      lastTransitionTime: "2025-10-08T17:55:42Z"
+      message: Scheduling has not completed
+      observedGeneration: 7
+      reason: SchedulePending
+      status: Unknown

This happens at

latestSchedulingPolicySnapshot.GetPolicySnapshotStatus().ObservedCRPGeneration < placementObj.GetGeneration() ||
scheduledCondition.Status == metav1.ConditionUnknown {
return metav1.Condition{
Status: metav1.ConditionUnknown,
Type: getPlacementScheduledConditionType(placementObj),
Reason: condition.SchedulingUnknownReason,
Message: "Scheduling has not completed",
ObservedGeneration: placementObj.GetGeneration(),
}

Mind you we only have very basic placement rule, so we don't even know why these are getting set to Unknown:

spec:
  policy:
    clusterNames:
    - lit-lca1-1-k8s-1
    placementType: PickFixed

Step 3

Hub agent's controller sees the changed items and adds them to the queue, workqueue starts building up:

Image

Step 4

Hub agent's cluster-resource-placement-controller-v1beta1's logs are flooded with the following message saying scheduling condition is Unknown:

clusterresourceplacement/controller.go:280] "Scheduler has not scheduled any cluster yet and requeue the request as a backup"
        clusterResourcePlacement="proxima-grpc-pod-26526"     scheduledCondition="&Condition{Type:ClusterResourcePlacementScheduled,Status:Unknown,ObservedGeneration:4,LastTransitionTime:2025-10-07 23:16:32 +0000 UTC,Reason:SchedulePending,Message:Scheduling has not completed,}"
        generation=4
...

and proceeds to requeue everything:

requeue outcomes:
Image

tl;dr

Setting ClusterResourcePlacementScheduled condition on CRP.spec changes to Unknown essentially makes Hub controller unable to process 10,000 objects, even with 40 workers (because it uses 1/10 of them, #273).

cc: @mikehelmick @ArchanaAnand0212

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions