Skip to content

Conversation

@jwtty
Copy link
Collaborator

@jwtty jwtty commented Apr 25, 2025

Description of your changes

This PR is the interface update to support showing multiple resourceSnapshot versions across targeted clusters on CRP.

Current behavior:
In current CRP status, there's crp.status.ObservedResourceIndex that is always set to the index of the latest resourceSnapshot currently exists on the hub cluster as current rollingUpdate rollout strategy only supports rolling to the latest resourceSnapshot version. The crp.status.Conditions (CRP status as a whole) as well as crp.status.placementStatuses (each targeted member clsuter status) are both based on this index.

This has several cons:

  1. With staged update run enabled, users are able to specify the resourceSnapshot version to rollout. When the version is not the latest, CRP status always shows as "RolloutNotStarted", because resourceSnapshot version does not match.
  2. During the rollout process, either with updateRun or rollingUpdate, different clusters may observe different resourceSnapshot versions, e.g. when rolling is waiting for resource to be available or just gets stuck due to some failure, current CRP always shows as "RolloutNotStarted", which does not accurately reflect the status. An example with manifestNotAvailable yet:
status:
  conditions:
  - lastTransitionTime: "2025-04-25T21:07:55Z"
    message: found all cluster needed as specified by the scheduling policy, found
      2 cluster(s)
    observedGeneration: 2
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2025-04-25T22:17:12Z"
    message: The rollout is being blocked by the rollout strategy in 1 cluster(s)
    observedGeneration: 2
    reason: RolloutNotStartedYet
    status: "False"
    type: ClusterResourcePlacementRolloutStarted
  observedResourceIndex: "3"
  placementStatuses:
  - clusterName: member1
    conditions:
    - lastTransitionTime: "2025-04-25T21:07:55Z"
      message: 'Successfully scheduled resources for placement in "member1" (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 2
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 2
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 2
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 2
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2025-04-25T22:17:13Z"
      message: All corresponding work objects are applied
      observedGeneration: 2
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2025-04-25T22:17:13Z"
      message: Work object example-placement-work is not yet available
      observedGeneration: 2
      reason: NotAllWorkAreAvailable
      status: "False"
      type: Available
    failedPlacements:
    - condition:
        lastTransitionTime: "2025-04-25T22:17:13Z"
        message: Manifest is not yet available; Fleet will check again later
        reason: ManifestNotAvailableYet
        status: "False"
        type: Available
      kind: Service
      name: nginx
      namespace: test-namespace
      version: v1
  - clusterName: member2
    conditions:
    - lastTransitionTime: "2025-04-25T21:07:55Z"
      message: 'Successfully scheduled resources for placement in "member2" (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 2
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 2
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted

In above example, rollout does happen to member1, it's just waiting for the resource to become available. But in the CRP condition, it still shows "RolloutNotStarted". And in member2, we actually loses more information that it currently is still running on the earlier version.

Proposed API

The proposed solution is to modify the definition of crp.status.ObservedResourceIndex from the index of latest resourceSnapshot observed on hub cluster to the index of the latest resourceSnapshot observed across all targeted memberclusters. For example, when the currently latest index is 4 on hub cluster but 2, 2, 3 on member clusters respectively, current crp.statusObservedResourceIndex is 4, but will be 3 with the proposed changed. This does not break existing rollingUpdate strategy as all member clusters will be updated to 4 eventually and crp will reach the final terminate state. And with updateRun or other external rollout strategy, users can find it more accurate to reflect the current CRP status: CRP now shows the version that user rolls out, not the default-latest version.

There are two options to handle per-cluster status:

  1. Each cluster still reports its status based on crp.status.ObservedResourceIndex. This aligns with current API.
  2. Add an ObservedResourceIndex inside per-cluster placement status. And each cluster's status is based on this version. For example, when current version on each cluster is 2, 3, 3 respectively, the first cluster shows RolloutNotStarted or RolloutUnknown as it's not being updated to version 3 yet, while with this new option, it's RolloutStarted and observedResourceIndex is 2. This somehow breaks current API but it displays more accurate status.

With the proposed solution, we can easily support different rollout strategy, show cluster status more accurately, and can support rollout back in the future.

Revised Apr 28:

When rollout strategy type is set to External, rollout is managed by an external controller and crp is not aware of the actual rollout plan/target. In this case, setting ObservedResourceIndex and RolloutStarted/Applied/Available conditions in the crp whole status does not make sense and can cause confusion. In this case, we only demonstrate the per-cluster observedResourceIndex and conditions in crp.statusResoucePlacementStatuses array. CRP condition will only show Scheduled and RolloutStarted Unknown.

#### Revised Apr 29:

Per offline discussion, we have decided that:

  • Per-cluster resourcePlacement status:
    For both RollingUpdate and External rollout strategies, ObservedResourceIndex and conditions including RolloutStarted/Overridden/WorkSynchronized/Applied/Available should reflect the current observed resource index of the member cluster and be consistent with the corresponding clusterResourceBinding status.
  • CRP status:
    CRP status serves as an aggregate of the per-cluster resourcePlacement status.
    • For RollingUpdate strategy:
      ObservedResourceIndex is the default-latest resourceSnapshot index and the conditions are based on it.
    • For External strategy:
      ObservedResourceIndex is not empty only if all the member clusters observe the same resource index.
      When ObservedResourceIndex is empty, all conditions except Scheduled should be in Unknown state.
      When ObservedResourceIndex is not empty, conditions show the aggregation of the clusters' status:
      • If any one of the cluster is Unknown, aggregated state is Unknown.
      • Else if any one of the cluster is False, aggregated state is False.
      • If all the clusters are True, aggregated state is True.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@codecov
Copy link

codecov bot commented Apr 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@jwtty jwtty force-pushed the multi-version-api branch from 7a85d9f to e087787 Compare April 25, 2025 23:21
@jwtty jwtty changed the title interface: CRP API change to support multiple resourceSnapshot versio… interface: support multiple resourceSnapshot versions across clusters Apr 26, 2025
ryanzhang-oss
ryanzhang-oss previously approved these changes Apr 28, 2025
@jwtty jwtty requested a review from circy9 April 28, 2025 20:28
Copy link
Contributor

@circy9 circy9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming members are all running snapshot v1 now, if snapshot v2 is discovered and rolled out to member1 (still waiting for availability) and not to member2 yet, how does the per-CRP and per-member status look like?

@jwtty
Copy link
Collaborator Author

jwtty commented Apr 29, 2025

Assuming members are all running snapshot v1 now, if snapshot v2 is discovered and rolled out to member1 (still waiting for availability) and not to member2 yet, how does the per-CRP and per-member status look like?

It's the exact content as the example I show in the PR description. CRP whole status shows RolloutStarted false as our API definition by this type is RolloutStarted set to True when rollout started on all clusters (this is a bit weird, true). And Per-member status is: for member1, it has RolloutStarted, ..., NotAvailableYet; while member2, it shows RolloutNotStartedYet.

jwtty added 3 commits April 30, 2025 00:22
…ns across clusters

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
@jwtty jwtty force-pushed the multi-version-api branch 2 times, most recently from aa31336 to 215172c Compare April 30, 2025 01:34
Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
@jwtty jwtty force-pushed the multi-version-api branch from 215172c to 1890ad9 Compare April 30, 2025 01:36
Copy link
Contributor

@circy9 circy9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussions, we agree to do the following:

  1. For all strategies, per-cluster observedresourceindex and RolloutStarted/Overridden/WorkSynchronized/Applied/Available should reflect the current observedresourceindex status of the member cluster (consistent with the corresponding binding status).

  2. For external rollout strategy, per-CRP observedresourceindex and RolloutStarted/Overridden/WorkSynchronized/Applied/Available should be an aggregate of the per-cluster status:

  • observedresourceindex is non-empty only if all the clusters have the same observedresourceindex.
  • When observedresourceindex is empty, all the conditions will be unknonw.
  • When observedresourceindex is set,
    • If any one of the cluster is unknown, the aggregated condition will be unknown.
    • (Else) if any one of the cluster is false, the aggregated condition is false.
    • If all of the clusters are true, the aggregated condition is true.
  1. For rollingupdate strategy, per-CRP observedresourceindex is the lastest resourceIndex and the conditions are for the latest.

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
@jwtty jwtty merged commit a2190f1 into kubefleet-dev:main Apr 30, 2025
20 of 25 checks passed
@jwtty jwtty deleted the multi-version-api branch April 30, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants