Skip to content

hub agent reconciliation metrics don't match controller-runtime's #274

@ahmetb

Description

@ahmetb

For whatever reason, we are observing that the fleet_workload_reconcile_total and the controller_runtime_reconcile_total metrics completely disagreeing with one another.

This has caused extra delay in mitigating an incident because company-wide we look at standard controller_runtime_reconcile_total metric the world of controller development uses, but turns out the issue we should be looking at was in Fleet's custom metric that worked very differently.

I highly recommend moving away from these things and just sticking to what controller-runtime's basic reconciler model, not maintaining separate work queues, and not doing interesting things.

Attaching the two metrics for reference from the same timeframe:

fleet metric:
Image

controller-runtime:
Image

These look nothing alike unfortunately.

For example, zooming in to a place where Fleet metrics showed "requeues" but controller-runtime metrics didn't. (This has cost us some valuable troubleshooting time during an incident):

fleet metric:
Image

controller-runtime:
Image

cc: @mikehelmick @ArchanaAnand0212

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions