Skip to content

Guidance

Quentin Manfroi edited this page Dec 2, 2020 · 33 revisions

🔗 Contents

Here is a list of usual customization rules and tips or recommendations possible to use for your detectors configuration. It is not an exhaustive list but it should provide key features for most of the users. More detailed explanations on how it works are available on Templating.

Model

First, you need to understand that one metric can generate lots of different MTS (time series) and each one has its own set of datapoints (depending on the reporting interval of the metric).

MTS is simply a unique combination of a metric with its all available metadata attached. The most common metadata type is the dimensions and dimensions are central in the detectors configuration because their behavior could highly depend on these.

For example, filtering and aggregation will use dimensions as source to operate and the relevance of alerting on heartbeat detectors depend on these filtering and aggregation "rules".

Metadata available on metrics mainly come from the data source they come from. For example, every metrics from aws will all have some dimensions in common like aws_account_id.

This also determine how data are reported. If AWS do not report data if there is no "change" (i.e. no traffic on ELB), the SignalFx Smart Agent for example will always send its metrics at regular interval. In the same way, agent metrics are real time but cloud integrations add a delay to collect metrics.

So the data source and the metadata are crucial to create fine detectors in this repository but also very useful to use existing ones and configure them properly. Also, please add as many metadata as you can on your metrics, this will allow you to configure a fine grained monitoring using filtering and aggregation capabilities to adapt your detectors behavior for each goal.

Filtering

All modules implement a filtering strategy common to all their detectors based on metadata. The default behavior is to apply the tagging convention corresponding to the module and its source.

The "per environment" oriented filtering by default on a specific allows the user to import multiple times the module for as many env as he likes. If this convention can not be applied to some modules it must be explained in their README.md.

To use this convention you do not have anything to configure at the detector level but you will need to add dimensions to your metrics to match the detectors filters. For example, add the env dimension to your data source, using globalDimensions on the SignalFx Smart Agent or add it as tag on an AWS service.

However, this default convention could not fit your requirements and you can override it using filter_custom_includes and filter_custom_excludes variables and specify your own filtering policy (or nothing) so feel free to change that.

More information in the Templating filtering section.

Multiple instances

You can also use these variables to import multiple times the same module with different filtering policy to match different resources (i.e. import and filter per host, this will duplicate detector like the Nagios approach).

In general, we prefer to enjoy the "automatic discovery" capability (i.e. like Prometheus) but it could be useful to apply fine grained configuration of the detectors to different resources.

When importing multiple times the same module it is recommended to use the prefixes variables.

Here is an example which:

  • use the system module 2 times, one with higher threshold for a specific host and the other one with default threshold for all others hosts respecting the default tagging convention.
  • import once the module with default filtering in filter_custom_includes but excluding the host with_different_threshold.
  • import a second time but this time only filters on host with_different_threshold.
module "signalfx-detectors-smart-agent-system-common" {
  source = "../../modules/smart-agent_system-common"

  environment   = var.environment
  notifications = local.notifications

  filter_custom_includes = [format("env:%s", var.environment), "sfx_monitored:true"]
  filter_custom_excludes = ["host:with_different_threshold"]
}

module "signalfx-detectors-smart-agent-system-common-high-load" {
  source = "../../modules/smart-agent_system-common"

  environment   = var.environment
  notifications = local.notifications

  prefixes               = ["LOADED"]
  filter_custom_includes = ["host:with_different_threshold"]

  load_threshold_critical = 6
  load_threshold_major    = 3
}

Aggregation

The default behavior of SignalFlow is to not aggregate every times series coming from a metric. Indeed, it will evaluate every single MTS separately considering every available metadata values combination.

Detectors in this repository do not use aggregation by default as much as possible to work in a maximum of scenarios.

Nevertheless, sometimes we want to aggregate at a "higher" level like to evaluate an entire cluster and not each of its members separately. In this case, the only way is to aggregate.

Detectors in this repository are generic (at least by default) and it is not possible to know in advance every available metadata available since they depend on each environment. This is why they use only "reserved" dimensions always available or, in some cases, special ones which will be explained in the local README.md of the module.

So, please be careful on detectors which :

  • do not have any aggregation by default, it will apply to all MTS so you will certainly prefer to explicitly aggregate on another level which make more sense in your environment.
  • have a default aggregation because it is probably crucial to make the detector works and if you change the aggregation you should probably keep every default dimensions and only add the one that are specific to your environment.

A very good example is the Hearbeat detector which is very sensible at these metadata aggregation which will determine the scope of the healthcheck. In general, try to define explicitly your own groups thanks to the aggregation_function variable to embrace fully your context, especially for heartbeat which could easily create many false alerts if you base its evaluation on "dynamic" or often changing dimensions values.

More information in Templating aggregation section.

Heartbeat

Heartbeat are perfect for monitoring healthcheck, it will fire alert for every group which do not report anymore. In general, each module has its own heartbeat which will check the availability of the data source (i.e. does the database respond?).

As seen before they highly depend on the aggregation used which will define the groups to evaluate and consider as "unhealthy":

  • avoid to not use aggregation while it each change on dimensions could lead to a group disappearing and so an alert. For example, if you remove, add or edit a globalDimensions at agent level it will probably raise alert for every heartbeats applied to the corresponding host.
  • ignore any "dynamic" dimensions (like pod_id) either removing them from the data source or defining explicitly aggregation at detector level.
  • in general, define your custom dimensions like the level or "business service" to your use them properly in filtering or aggregation.

As you should understand we highly recommend to define an explicit adapted aggregation for your scenario depending for heartbeat detectors which are little special.

Some useful information about this:

  • vm state are filtered out automatically to support downscaling from gcp, aws and azure.
  • when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 36h to signalfx consider as inactive and stop to raise alert on it. use a muting rules during this time.

More information in Templating heartbeat section.

Notifications

Every detectors in this repository will have, at least, one rule and every rules represent different severities level for an alert on check done by a detector.

You can check the recommended destination of severity binding. Then, you just will have to define a list of recipients for each one.

locals {
  notification_slack = "Slack,credentialId"
  notification_pager = "PagerDuty,credentialId"
  notifications      = {
    critical = [local.notification_slack, notification_pager]
    major    = [local.notification_slack, notification_pager]
    minor    = [local.notification_slack]
    warning  = [local.notification_slack]
    info     = []
  }
}

In this example we forward Critical and Major alerts to PagerDuty and Slack, minor and warning to slack only and info not nothing.

You can use locals and variables to define this binding and we generally retrieve the integration Id (credentialId) from the output of configured integration like the PagerDuty integration.

In any case, you have to define each possible severity for the object even if one of them do not interest you, this is for safety purpose.Of course you can override this binding at detector or rule level thanks to notifications variable but the global will apply to all detectors which do not have overridden value.

More information in Templating notifications section.

Agent configuration

The SignalFx Smart Agent is the source of lot of data used as metrics by detectors in this repository. This is why it is crucial to know it well and to understand its deployment model and some tips to use to match right the detectors behavior.

Full configuration options are available on the official documentation.

Deployment mode

The standard deployment represents the mode where the agent is installed next to the service it monitors. For example, collect metrics from a database like MySQL installed on the Virtual Machine where the agent run.

Detectors are configured, by default, to work in this mode in priority (where the choice has to be done which generally consists of the aggregation configuration.

But sometimes the agent will collect metrics from external service like an AWS RDS endpoint to keep the database example. In this case, this is generally recommended to:

  • disable host dimensions using disableHostDimensions parameter to not use the hostname of the virtual machine where run the agent as host dimension.
  • override the host dimension value defining manually with extraDimensions parameter with the RDS name in our example

Kubernetes

For kubernetes we recommend to deploy from the helm chart 2 different agent workloads:

  • a daemonset mandatory to monitor each node of the cluster and fetch every internal metrics.
  • a simple optional deployment which will run its agent on only one node to monitor once some external targets like webchecks or managed services like AWS RDS or GCP Cloud SQL. You have to define isServerless: true option in the chart for this (it will disableHostDimensions as explained above).

Dimensions

You can add custom dimensions at global level (applied to all monitors) using globalDimensions or to the metrics of the related monitor only using extraDimensions.

Tis is also possible to fetch dimensions from endpoints discovered by the service discovery using the extraDimensionsFromEndpoint parameter.

In contrast you can also remove every dimensions from service discovery configuring disableEndpointDimensions. Or you can delete a list of specific undesired dimensions by using the dimensionTransformations to no value.

Service discovery

If the role of monitors is to collect metrics, the role of observers is to discover endpoints.

It is possible to combine both, automatically configuring monitor of each endpoint discovered by an observer which match the defined discovery rule.

This is often used in highly dynamic environment like containers but could be useful to automate configuration based on "rules" if your middlewares are always deployed in the same way to a fleet of instances.

Filtering and extra metrics

Every monitor has its own default metrics always reporting (in bold in documentation) but also propose non default metrics which are considered as "custom" and need to be explicitly enabled from extraMetrics or extraGroups parameters. Using extraMetrics: [*] will lead to accept all metrics from the monitor.

In contrast, we can want to filter in or out coming metrics with datapointsToExclude. You can see the official dedicated documentation.

For example it is possible to use "whitelisting" based filtering policy:

    datapointsToExclude:
      - metricNames:
        - '*'
        - '!i_want_this_metric'
        - '!and_this_one'
        - '!but_no_more'
        - '!than_these_4'

Troubleshooting

  • Check available endpoints and their available dimensions to configure service service discovery defining the right discoveryRule.
$ sudo signalfx-agent status endpoints
  • In case of collect problem, check if the corresponding monitor is properly configured:
$ sudo signalfx-agent status monitors
  • If it does not appear in this list, check the SignalFx Smart Agent logs:
$ sudo journalctl -u signalfx-agent -f -n 200
  • Else, check if values are sent from the following command:
$ sudo signalfx-agent tap-dps

Clone this wiki locally