-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Describe the bug
"kubernetes-fleet.io/last-applied-configuration" includes "deployment.kubernetes.io/revision". If the annotation in the hub deployment goes out of sync with the annotaions on the member clusters, this can cause a behavior where work is never considered available because:
- work applier applies the last-applied-configuration to member cluster deployment with revision from hub cluster, which updates the deployment to a new generation
- the availability check compares generation with observed generation (generation - 1 at this point)
- the deployment is not considered available
- k8s deployment controller updates the revision annotation to a correct number (correct for the member cluster but different from hub cluster deployment)
- in next work applier reconciliation loop, fleet updates last-applied-configuration again, now goes back to step 1
Environment
Please provide the following:
- Hub cluster details
- Member cluster details
To Reproduce
Steps to reproduce the behavior:
I do these to make hub cluster deployment revision annotation and member cluster deployment revision annotations out of sync:
- kubectl apply a resource placement with PickN strategy and maxUnavailable == 10 (some large number so deployment goes out to all member clusters quickly) with a valid deployment to 4 (the problem is more obvious if you have more member clusters) member clusters at the same time using one yaml with kubectl apply -f
- update the image of the deployment to an invalid tag (e.g. nginx:1.25 -> nginx:1.25.99) with kubectl apply -f
- update the image of the deployment to some other invalid tag (e.g. nginx:1.25 -> nginx:1.25.999) using kubectl apply -f within 5 seconds since last update
- do a few bad updates to the deployment quickly to cause hub revision and member revision to go out of sync
- now update the image back to a good image (e.g. nginx:1.25)
- the works and resource bindings will never become available because generation != observedGeneration due to different deployment revision on last-applied-configuration
- since the resource bindings are never available, if you change maxUnavailable to 1 and rollout another deployment with non-existent image, fleet will not respect maxUnavailable == 1 and will roll out to all members immediately because fleet considers it safe to roll out to all clusters with bad deployments at once (think the assumption is that if the previous deployment is bad, the new deployment cannot be worse).
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Maybe should remove deployment.kubernetes.io/revision in sanitizeManifestObject so that it is not added to last-applied-configuration