interface: support multiple resourceSnapshot versions across clusters #33

jwtty · 2025-04-25T22:27:55Z

Description of your changes

This PR is the interface update to support showing multiple resourceSnapshot versions across targeted clusters on CRP.

Current behavior:
In current CRP status, there's crp.status.ObservedResourceIndex that is always set to the index of the latest resourceSnapshot currently exists on the hub cluster as current rollingUpdate rollout strategy only supports rolling to the latest resourceSnapshot version. The crp.status.Conditions (CRP status as a whole) as well as crp.status.placementStatuses (each targeted member clsuter status) are both based on this index.

This has several cons:

With staged update run enabled, users are able to specify the resourceSnapshot version to rollout. When the version is not the latest, CRP status always shows as "RolloutNotStarted", because resourceSnapshot version does not match.
During the rollout process, either with updateRun or rollingUpdate, different clusters may observe different resourceSnapshot versions, e.g. when rolling is waiting for resource to be available or just gets stuck due to some failure, current CRP always shows as "RolloutNotStarted", which does not accurately reflect the status. An example with manifestNotAvailable yet:

status:
  conditions:
  - lastTransitionTime: "2025-04-25T21:07:55Z"
    message: found all cluster needed as specified by the scheduling policy, found
      2 cluster(s)
    observedGeneration: 2
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2025-04-25T22:17:12Z"
    message: The rollout is being blocked by the rollout strategy in 1 cluster(s)
    observedGeneration: 2
    reason: RolloutNotStartedYet
    status: "False"
    type: ClusterResourcePlacementRolloutStarted
  observedResourceIndex: "3"
  placementStatuses:
  - clusterName: member1
    conditions:
    - lastTransitionTime: "2025-04-25T21:07:55Z"
      message: 'Successfully scheduled resources for placement in "member1" (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 2
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 2
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 2
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 2
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2025-04-25T22:17:13Z"
      message: All corresponding work objects are applied
      observedGeneration: 2
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2025-04-25T22:17:13Z"
      message: Work object example-placement-work is not yet available
      observedGeneration: 2
      reason: NotAllWorkAreAvailable
      status: "False"
      type: Available
    failedPlacements:
    - condition:
        lastTransitionTime: "2025-04-25T22:17:13Z"
        message: Manifest is not yet available; Fleet will check again later
        reason: ManifestNotAvailableYet
        status: "False"
        type: Available
      kind: Service
      name: nginx
      namespace: test-namespace
      version: v1
  - clusterName: member2
    conditions:
    - lastTransitionTime: "2025-04-25T21:07:55Z"
      message: 'Successfully scheduled resources for placement in "member2" (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 2
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-25T22:17:12Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 2
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted

In above example, rollout does happen to member1, it's just waiting for the resource to become available. But in the CRP condition, it still shows "RolloutNotStarted". And in member2, we actually loses more information that it currently is still running on the earlier version.

Proposed API

The proposed solution is to modify the definition of crp.status.ObservedResourceIndex from the index of latest resourceSnapshot observed on hub cluster to the index of the latest resourceSnapshot observed across all targeted memberclusters. For example, when the currently latest index is 4 on hub cluster but 2, 2, 3 on member clusters respectively, current crp.statusObservedResourceIndex is 4, but will be 3 with the proposed changed. This does not break existing rollingUpdate strategy as all member clusters will be updated to 4 eventually and crp will reach the final terminate state. And with updateRun or other external rollout strategy, users can find it more accurate to reflect the current CRP status: CRP now shows the version that user rolls out, not the default-latest version.

There are two options to handle per-cluster status:

Each cluster still reports its status based on crp.status.ObservedResourceIndex. This aligns with current API.
Add an ObservedResourceIndex inside per-cluster placement status. And each cluster's status is based on this version. For example, when current version on each cluster is 2, 3, 3 respectively, the first cluster shows RolloutNotStarted or RolloutUnknown as it's not being updated to version 3 yet, while with this new option, it's RolloutStarted and observedResourceIndex is 2. This somehow breaks current API but it displays more accurate status.

With the proposed solution, we can easily support different rollout strategy, show cluster status more accurately, and can support rollout back in the future.

Revised Apr 28:

When rollout strategy type is set to External, rollout is managed by an external controller and crp is not aware of the actual rollout plan/target. In this case, setting ObservedResourceIndex and RolloutStarted/Applied/Available conditions in the crp whole status does not make sense and can cause confusion. In this case, we only demonstrate the per-cluster observedResourceIndex and conditions in crp.statusResoucePlacementStatuses array. CRP condition will only show Scheduled and RolloutStarted Unknown.

#### Revised Apr 29:
Per offline discussion, we have decided that:

Per-cluster resourcePlacement status:
For both RollingUpdate and External rollout strategies, ObservedResourceIndex and conditions including RolloutStarted/Overridden/WorkSynchronized/Applied/Available should reflect the current observed resource index of the member cluster and be consistent with the corresponding clusterResourceBinding status.

CRP status:
CRP status serves as an aggregate of the per-cluster resourcePlacement status.

For RollingUpdate strategy:
ObservedResourceIndex is the default-latest resourceSnapshot index and the conditions are based on it.

For External strategy:
ObservedResourceIndex is not empty only if all the member clusters observe the same resource index.
When ObservedResourceIndex is empty, all conditions except Scheduled should be in Unknown state.
When ObservedResourceIndex is not empty, conditions show the aggregation of the clusters' status:

If any one of the cluster is Unknown, aggregated state is Unknown.

Else if any one of the cluster is False, aggregated state is False.

If all the clusters are True, aggregated state is True.

Fixes #

I have:

Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

codecov · 2025-04-25T22:50:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

apis/placement/v1beta1/clusterresourceplacement_types.go

circy9

Assuming members are all running snapshot v1 now, if snapshot v2 is discovered and rolled out to member1 (still waiting for availability) and not to member2 yet, how does the per-CRP and per-member status look like?

jwtty · 2025-04-29T01:22:04Z

Assuming members are all running snapshot v1 now, if snapshot v2 is discovered and rolled out to member1 (still waiting for availability) and not to member2 yet, how does the per-CRP and per-member status look like?

It's the exact content as the example I show in the PR description. CRP whole status shows RolloutStarted false as our API definition by this type is RolloutStarted set to True when rollout started on all clusters (this is a bit weird, true). And Per-member status is: for member1, it has RolloutStarted, ..., NotAvailableYet; while member2, it shows RolloutNotStartedYet.

apis/placement/v1beta1/clusterresourceplacement_types.go

…ns across clusters Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

apis/placement/v1beta1/clusterresourceplacement_types.go

circy9

Based on offline discussions, we agree to do the following:

For all strategies, per-cluster observedresourceindex and RolloutStarted/Overridden/WorkSynchronized/Applied/Available should reflect the current observedresourceindex status of the member cluster (consistent with the corresponding binding status).
For external rollout strategy, per-CRP observedresourceindex and RolloutStarted/Overridden/WorkSynchronized/Applied/Available should be an aggregate of the per-cluster status:

observedresourceindex is non-empty only if all the clusters have the same observedresourceindex.
When observedresourceindex is empty, all the conditions will be unknonw.
When observedresourceindex is set,
- If any one of the cluster is unknown, the aggregated condition will be unknown.
- (Else) if any one of the cluster is false, the aggregated condition is false.
- If all of the clusters are true, the aggregated condition is true.

For rollingupdate strategy, per-CRP observedresourceindex is the lastest resourceIndex and the conditions are for the latest.

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

…kubefleet-dev#33)

jwtty requested review from michaelawyu, ryanzhang-oss and zhiying-lin April 25, 2025 22:28

jwtty force-pushed the multi-version-api branch from c231ce4 to 7a85d9f Compare April 25, 2025 22:29

jwtty force-pushed the multi-version-api branch from 7a85d9f to e087787 Compare April 25, 2025 23:21

jwtty changed the title ~~interface: CRP API change to support multiple resourceSnapshot versio…~~ interface: support multiple resourceSnapshot versions across clusters Apr 26, 2025

ryanzhang-oss reviewed Apr 28, 2025

View reviewed changes

apis/placement/v1beta1/clusterresourceplacement_types.go Outdated Show resolved Hide resolved

ryanzhang-oss previously approved these changes Apr 28, 2025

View reviewed changes

jwtty requested a review from circy9 April 28, 2025 20:28

circy9 reviewed Apr 29, 2025

View reviewed changes

jwtty dismissed ryanzhang-oss’s stale review via 339275e April 29, 2025 01:27

jwtty force-pushed the multi-version-api branch from 891db47 to 339275e Compare April 29, 2025 01:27

jwtty commented Apr 29, 2025

View reviewed changes

apis/placement/v1beta1/clusterresourceplacement_types.go Outdated Show resolved Hide resolved

ryanzhang-oss reviewed Apr 29, 2025

View reviewed changes

apis/placement/v1beta1/clusterresourceplacement_types.go Outdated Show resolved Hide resolved

apis/placement/v1beta1/clusterresourceplacement_types.go Outdated Show resolved Hide resolved

jwtty added 3 commits April 30, 2025 00:22

interface: CRP API change to support multiple resourceSnapshot versio…

def7562

…ns across clusters Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

revise API

80abd22

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

minor fix

1b8449c

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

jwtty force-pushed the multi-version-api branch 2 times, most recently from aa31336 to 215172c Compare April 30, 2025 01:34

revise

1890ad9

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

jwtty force-pushed the multi-version-api branch from 215172c to 1890ad9 Compare April 30, 2025 01:36

circy9 reviewed Apr 30, 2025

View reviewed changes

apis/placement/v1beta1/clusterresourceplacement_types.go Outdated Show resolved Hide resolved

circy9 reviewed Apr 30, 2025

View reviewed changes

fix comments

48993ec

Signed-off-by: Wantong Jiang <wantjian@microsoft.com>

ryanzhang-oss approved these changes Apr 30, 2025

View reviewed changes

jwtty merged commit a2190f1 into kubefleet-dev:main Apr 30, 2025
20 of 25 checks passed

jwtty deleted the multi-version-api branch April 30, 2025 20:28

audrastump pushed a commit to audrastump/kubefleet that referenced this pull request May 7, 2025

interface: support multiple resourceSnapshot versions across clusters (…

499e676

…kubefleet-dev#33)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

interface: support multiple resourceSnapshot versions across clusters #33

interface: support multiple resourceSnapshot versions across clusters #33

Uh oh!

jwtty commented Apr 25, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

circy9 left a comment

Uh oh!

jwtty commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

circy9 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

interface: support multiple resourceSnapshot versions across clusters #33

interface: support multiple resourceSnapshot versions across clusters #33

Uh oh!

Conversation

jwtty commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

Proposed API

Revised Apr 28:

How has this code been tested

Special notes for your reviewer

Uh oh!

codecov bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

circy9 left a comment

Choose a reason for hiding this comment

Uh oh!

jwtty commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

circy9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jwtty commented Apr 25, 2025 •

edited

Loading

codecov bot commented Apr 25, 2025 •

edited

Loading