-
Notifications
You must be signed in to change notification settings - Fork 19
Description
For whatever reason, we are observing that the fleet_workload_reconcile_total and the controller_runtime_reconcile_total metrics completely disagreeing with one another.
This has caused extra delay in mitigating an incident because company-wide we look at standard controller_runtime_reconcile_total metric the world of controller development uses, but turns out the issue we should be looking at was in Fleet's custom metric that worked very differently.
I highly recommend moving away from these things and just sticking to what controller-runtime's basic reconciler model, not maintaining separate work queues, and not doing interesting things.
Attaching the two metrics for reference from the same timeframe:
These look nothing alike unfortunately.
For example, zooming in to a place where Fleet metrics showed "requeues" but controller-runtime metrics didn't. (This has cost us some valuable troubleshooting time during an incident):



