-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
Describe the bug
On Kubernetes clusters with lots of pods (>10k), the PodInfo reflector times out. I'm only measuring cluster size in number of pods because I think that's the only resource that informs the PodInfo cache.
We also have a cluster that's 30,000 pods and see the same behavior. On our smaller clusters (400 pods), we don't see this behavior
Steps to reproduce
- Create a cluster with 10,000 pods
- Start the
aws-load-balancer-controller - Controller goes into CrashLoopBackOff
Expected outcome
I'd expect for there to be some way we can increase the timeout so we can give the controller more time to get the PodInfo.
Environment
We've been running v2.4.4 that that works on the same cluster (with more than 10k pods). When we tried to upgrade, we couldn't get the controller to stay online. Newer versions of the controller will start, then log
{"level":"info","msg":"starting podInfo repo"}
⌛ this takes a good amount of time ~60s
{"level":"error","msg":"problem wait for podInfo repo sync", "error": "timed out waiting for the condition"}
- ❌ v2.7.2, v2.8.3, v2.9.0: These are the versions we see this behavior on.
- ✅ v2.4.4: These versions of the controller work as expected on clusters with more than 30,000 pods.
- 1.28: Our Kubernetes Version
- Not using EKS
Metadata
Metadata
Assignees
Labels
No labels