Skip to content

Guidance

Quentin Manfroi edited this page Nov 16, 2020 · 33 revisions

Here is a list of usual customization rules and tips or recommendations possible to use for their detectors configuration. This is not an exhaustive list but it should provide key features for most of the users. More detailed explanations it is work are available on Templating.

Model

First, you need to understand that one metric could generate lot of different MTS (time series) and each one its own set of datapoints (depending on the reporting interval of the metric).

MTS is simply an unique combination of a metric with its all available metadata attached. Most spread metadata type is the dimensions and they are central in the detectors configuration because their behavior could highly depend on these.

For example, filtering and aggregation will use dimensions as source to operate and the relevance of alerting on heartbeat detectors depend on these filtering and aggregation "rules".

Metadata available on metrics mainly come from the data source they come from. For example, every metrics from aws will all have some dimensions in common like aws_account_id.

This also often determine how data are reported. If AWS often do not report data if there is no "change" (i.e. no traffic on ELB), the SignalFx Smart Agent for example will always its metrics at regular interval. In the same way, agent metrics are real time but cloud integrations add a delay to collect metrics.

So the data source and the metadata are crucial to create fine detectors in this repository but also very useful to use existing ones and configure them properly. Also, please add as many metadata as you can on your metrics, this will allow you to configure a fine grained monitoring using filtering and aggregation capabilities to adapt your detectors behavior for each goal.

Filtering

All modules implement a filtering strategy common to all their detectors based on metadata. The default behavior is to apply the tagging convention corresponding to the module and its source.

This how the modules are "per environment" oriented filtering by default on a specific and allows the user to import multiple times the module for as many env as he has. If this convention could not apply to some modules this should be explained in their README.md.

To use this convention you do not have anything to configure at detectors level but you will have add dimensions to your metrics to match this the detectors filtering. For example, adding env dimension to your data source like globalDimensions for the SignalFx Smart Agent or adding it as tag on an AWS service).

However, this default convention could not fit your requirements and you can override it using filter_custom_includes and filter_custom_excludes variables and specifying your own filtering policy (or nothing) so feel free to change that.

More information in Templating filtering section.

Aggregation

The default behavior of SignalFlow is to not aggregate all times series coming from a metric. Indeed, it will evaluate every single MTS separately considering every avaiable metadata values combination.

Detectors in this repository do not use aggregation by default as much as possible to work in a maximum of scenarios.

Nevertheless, sometimes we want to aggregate at "higher" level like evaluate an entire cluster and not each of its members separately. In this case, the only way is to aggregate.

The detectors in this repository as generic (at least by default) and it is not possible to know in advance every available metadata available especially since they depend on each environment. This is why they use only "reserved" dimensions always available or, in some cases, special ones which will be explained in the local README.md of the module.

So, please be careful on detector which :

  • does not have aggregation by default, it will apply to all MTS so may be you would prefer to explicitly aggregate to another level which does makes more sens in your environment.
  • does have a default aggregation because it is probably crucial to make the detector works and if you change the aggregation you should probably keep every default dimensions and only add your others specific to your environment.

A very good example is the Hearbeat based detectors which are very sensible at these metadata aggregation which will determine the scope of the healthcheck. In general, try to define explicitly your own groups thanks to the aggregation_function variable to embrace fully your context, especially for heartbeat which could easily create many false alerts if you base its evaluation on "dynamic" or often changing dimensions values.

More information in Templating aggregation section.

Heartbeat

Heartbeat are perfect for monitoring healthcheck, it will fire alert for every group which do not report anymore. In general, each module has its own heartbeat which will check the availability of the data source (i.e. does the database respond?).

As seen before they highly depend on the aggregation used which will define the groups to evaluate and consider as "unhealthy":

  • avoid to not use aggregation while it each change on dimensions could lead to a group disappearing and so an alert. For example, if you remove, add or edit a globalDimensions at agent level it will probably raise alert for every heartbeats applied to the corresponding host.
  • ignore any "dynamic" dimensions (like pod_id) either removing them from the data source or defining explicitly aggregation at detector level.
  • in general, define your custom dimensions like the level or "business service" to your use them properly in filtering or aggregation.

As you should understand we highly recommend to define an explicit adapted aggregation for your scenario depending for heartbeat detectors which are little special.

Some useful information about this:

  • vm state are filtered out automatically to support downscaling from gcp, aws and azure.
  • when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 36h to signalfx consider as inactive and stop to raise alert on it. use a muting rules during this time.

More information in Templating heartbeat section.

Notifications

Every detectors in this repository will have, at least, one rule and every rules represent different severities level for an alert on check done by a detector.

You can check the recommended destination of severity binding. Then, you just will have to define a list of recipients for each one.

locals {
  notification_slack = "Slack,credentialId"
  notification_pager = "PagerDuty,credentialId"
  notifications      = {
    critical = [local.notification_slack, notification_pager]
    major    = [local.notification_slack, notification_pager]
    minor    = [local.notification_slack]
    warning  = [local.notification_slack]
    info     = []
  }
}

In this example we forward Critical and Major alerts to PagerDuty and Slack, minor and warning to slack only and info not nothing.

You can use locals and variables to define this binding and we generally retrieve the integration Id (credentialId) from the output of configured integration like the PagerDuty integration.

In any case, you have to define each possible severity for the object even if one of them do not interest you, this is for safety purpose.Of course you can override this binding at detector or rule level thanks to notifications variable but the global will apply to all detectors which do not have overridden value.

More information in Templating notifications section.

Agent configuration

The SignalFx Smart Agent is the source of lot of data used as metrics by detectors in this repository. This is why it is crucial to know it well and to understand its deployment model and some tips to use to match right the detectors behavior.

Full configuration options are available on the official documentation.

Deployment mode

The standard deployment represents the mode where the agent is installed next to the service it monitors. For example, collect metrics from a database like MySQL installed on the Virtual Machine where the agent run.

Detectors are configured, by default, to work in this mode in priority (where the choice has to be done which generally consists of the aggregation configuration.

But sometimes the agent will collect metrics from external service like an AWS RDS endpoint to keep the database example. In this case, this is generally recommended to:

  • disable host dimensions using disableHostDimensions parameter to not use the hostname of the virtual machine where run the agent as host dimension.
  • override the host dimension value defining manually with extraDimensions parameter with the RDS name in our example

Kubernetes

For kubernetes we recommend to deploy from the helm chart 2 different agent workloads:

  • a daemonset mandatory to monitor each node of the cluster and fetch every internal metrics.
  • a simple optional deployment which will run its agent on only one node to monitor once some external targets like webchecks or managed services like AWS RDS or GCP Cloud SQL. You have to define isServerless: true option in the chart for this (it will disableHostDimensions as explained above).

Dimensions

You can add custom dimensions at global level (applied to all monitors) using globalDimensions or to the metrics of the related monitor only using extraDimensions.

Tis is also possible to fetch dimensions from endpoints discovered by the service discovery using the extraDimensionsFromEndpoint parameter.

In contrast you can also remove every dimensions from service discovery configuring disableEndpointDimensions. Or you can delete a list of specific undesired dimensions by using the dimensionTransformations to no value.

Service discovery

If the role of monitors is to collect metrics, the role of observers is to discover endpoints.

It is possible to combine both, automatically configuring monitor of each endpoint discovered by an observer which match the defined discovery rule.

This is often used in highly dynamic environment like containers but could be useful to automate configuration based on "rules" if your middlewares are always deployed in the same way to a fleet of instances.

Filtering and extra metrics

Every monitor has its own default metrics always reporting (in bold in documentation) but also propose non default metrics which are considered as "custom" and need to be explicitly enabled from extraMetrics or extraGroups parameters. Using extraMetrics: [*] will lead to accept all metrics from the monitor.

In contrast, we can want to filter in or out coming metrics with datapointsToExclude. You can see the official dedicated documentation.

For example it is possible to use "whitelisting" based filtering policy:

    datapointsToExclude:
      - metricNames:
        - '*'
        - '!i_want_this_metric'
        - '!and_this_one'
        - '!but_no_more'
        - '!than_these_4'

Troubleshooting

  • Check available endpoints and their available dimensions to configure service service discovery defining the right discoveryRule.
$ sudo signalfx-agent status endpoints
  • In case of collect problem, check if the corresponding monitor is properly configured:
$ sudo signalfx-agent status monitors
  • If it does not appear in this list, check the SignalFx Smart Agent logs:
$ sudo journalctl -u signalfx-agent -f -n 200
  • Else, check if values are sent from the following command:
$ sudo signalfx-agent tap-dps

Clone this wiki locally