-
Notifications
You must be signed in to change notification settings - Fork 36
Guidance
🔗 Contents
Here is a list of usual customization rules and tips or recommendations possible to use for their detectors configuration. This is not an exhaustive list but it should provide key features for most of the users. More detailed explanations it is work are available on Templating.
First, you need to understand that one metric could generate lot of different MTS (time series) and each one its own set of datapoints (depending on the reporting interval of the metric).
MTS is simply an unique combination of a metric with its all available metadata attached.
Most spread metadata type is the dimensions and they are central in the detectors
configuration because their behavior could highly depend on these.
For example, filtering and aggregation will use dimensions as source to operate and the relevance of alerting on heartbeat detectors depend on these filtering and aggregation "rules".
Metadata available on metrics mainly come from the data source they come from. For example,
every metrics from aws will all have some dimensions in common like aws_account_id.
This also often determine how data are reported. If AWS often do not report data if there is no "change" (i.e. no traffic on ELB), the SignalFx Smart Agent for example will always its metrics at regular interval. In the same way, agent metrics are real time but cloud integrations add a delay to collect metrics.
So the data source and the metadata are crucial to create fine detectors in this repository but also very useful to use existing ones and configure them properly. Also, please add as many metadata as you can on your metrics, this will allow you to configure a fine grained monitoring using filtering and aggregation capabilities to adapt your detectors behavior for each goal.
All modules implement a filtering strategy common to all their detectors based on metadata.
The default behavior is to apply the tagging convention corresponding
to the module and its source.
This how the modules are "per environment" oriented filtering by default on a specific and allows the user to import multiple times the module for as many env as he has. If this convention could not apply to some modules this should be explained in their README.md.
To use this convention you do not have anything to configure at detectors level but you will
have add dimensions to your metrics to match this the detectors filtering.
For example, adding env dimension to your data source like globalDimensions for the
SignalFx Smart Agent or adding it as tag on an AWS service).
However, this default convention could not fit your requirements and you can override it using filter_custom_includes and filter_custom_excludes variables and specifying your own filtering policy (or nothing) so feel free to change that.
More information in Templating filtering section.
You can also use these variables to import multiple times the same module with different
filtering policy to match different resources specifically (i.e. import and filter per host,
this will duplicate detector like the Nagios approach).
In general, we prefer to enjoy the "automatic discovery" capability (i.e. like Prometheus) but
it could be useful to apply fine grained configuration of the detectors to different resources.
If you want to make an exception you can:
- import module once reusing the same filtering as the default in
filter_custom_includesbut adding your filter to not match your "exception" withfilter_custom_excludes. - import another one the module, this time only matching your "exception" from
filter_custom_includes. - then you can apply specific configuration like thresholds to the exception but keep a general "rule" for the others
When importing multiple times the same module it is recommended to use the prefixes variables.
The default behavior of SignalFlow is to not aggregate all times series coming from a metric. Indeed, it will evaluate every single MTS separately considering every avaiable metadata values combination.
Detectors in this repository do not use aggregation by default as much as possible to work in a maximum of scenarios.
Nevertheless, sometimes we want to aggregate at "higher" level like evaluate an entire cluster and not each of its members separately. In this case, the only way is to aggregate.
The detectors in this repository as generic (at least by default) and it is not possible to know in advance every available metadata available especially since they depend on each environment. This is why they use only "reserved" dimensions always available or, in some cases, special ones which will be explained in the local README.md of the module.
So, please be careful on detector which :
- does not have aggregation by default, it will apply to all MTS so may be you would prefer to explicitly aggregate to another level which does makes more sens in your environment.
- does have a default aggregation because it is probably crucial to make the detector works and if you change the aggregation you should probably keep every default dimensions and only add your others specific to your environment.
A very good example is the Hearbeat based detectors which are very sensible at these metadata aggregation which will determine the scope of the healthcheck. In general, try to define explicitly your own groups thanks to the aggregation_function variable to embrace fully your context, especially for heartbeat which could easily create many false alerts if you base its evaluation on "dynamic" or often changing dimensions values.
More information in Templating aggregation section.
Heartbeat are perfect for monitoring healthcheck, it will fire alert for every group which do not report anymore. In general, each module has its own heartbeat which will check the availability of the data source (i.e. does the database respond?).
As seen before they highly depend on the aggregation used which will define the groups to evaluate and consider as "unhealthy":
- avoid to not use aggregation while it each change on dimensions could lead to a group
disappearing and so an alert. For example, if you remove, add or edit a
globalDimensionsat agent level it will probably raise alert for every heartbeats applied to the corresponding host. - ignore any "dynamic" dimensions (like
pod_id) either removing them from the data source or defining explicitly aggregation at detector level. - in general, define your custom dimensions like the level or "business service" to your use them properly in filtering or aggregation.
As you should understand we highly recommend to define an explicit adapted aggregation for your scenario depending for heartbeat detectors which are little special.
Some useful information about this:
- vm state are filtered out automatically to support downscaling from gcp, aws and azure.
- when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 36h to signalfx consider as inactive and stop to raise alert on it. use a muting rules during this time.
More information in Templating heartbeat section.
Every detectors in this repository will have, at least, one rule and every rules represent different severities level for an alert on check done by a detector.
You can check the recommended destination of severity binding. Then, you just will have to define a list of recipients for each one.
locals {
notification_slack = "Slack,credentialId"
notification_pager = "PagerDuty,credentialId"
notifications = {
critical = [local.notification_slack, notification_pager]
major = [local.notification_slack, notification_pager]
minor = [local.notification_slack]
warning = [local.notification_slack]
info = []
}
}In this example we forward Critical and Major alerts to PagerDuty and Slack,
minor and warning to slack only and info not nothing.
You can use locals and variables to define this binding and we generally retrieve the
integration Id (credentialId) from the output of configured integration like the PagerDuty
integration.
In any case, you have to define each possible severity for the object even if one of them do
not interest you, this is for safety purpose.Of course you can override this binding at
detector or rule level thanks to notifications
variable but the global will apply to all detectors which do not have
overridden value.
More information in Templating notifications section.
The SignalFx Smart Agent is the source of lot of data used as metrics by detectors in this repository. This is why it is crucial to know it well and to understand its deployment model and some tips to use to match right the detectors behavior.
Full configuration options are available on the official documentation.
The standard deployment represents the mode where the agent is installed next to the
service it monitors. For example, collect metrics from a database like MySQL installed on
the Virtual Machine where the agent run.
Detectors are configured, by default, to work in this mode in priority (where the choice has to be done which generally consists of the aggregation configuration.
But sometimes the agent will collect metrics from external service like an AWS RDS endpoint to keep the database example. In this case, this is generally recommended to:
- disable host dimensions using
disableHostDimensionsparameter to not use the hostname of the virtual machine where run the agent ashostdimension. - override the
hostdimension value defining manually withextraDimensionsparameter with the RDS name in our example
For kubernetes we recommend to deploy from the helm chart 2 different agent workloads:
- a daemonset mandatory to monitor each node of the cluster and fetch every internal metrics.
- a simple optional deployment which will run its agent on only one node to monitor once some
external targets like webchecks or managed services like AWS RDS or GCP Cloud SQL.
You have to define
isServerless: trueoption in the chart for this (it willdisableHostDimensionsas explained above).
You can add custom dimensions at global level (applied to all monitors) using
globalDimensions or to the metrics of the related monitor only using extraDimensions.
Tis is also possible to fetch dimensions from endpoints discovered by
the service discovery
using the extraDimensionsFromEndpoint parameter.
In contrast you can also remove every dimensions from service discovery configuring
disableEndpointDimensions. Or you can delete a list of specific undesired dimensions by using the
dimensionTransformations to no value.
If the role of monitors is to collect metrics, the role of observers is to discover endpoints.
It is possible to combine both, automatically configuring monitor of each endpoint discovered by an observer which match the defined discovery rule.
This is often used in highly dynamic environment like containers but could be useful to automate configuration based on "rules" if your middlewares are always deployed in the same way to a fleet of instances.
Every monitor has its own default metrics always reporting (in bold in documentation) but also
propose non default metrics which are considered as "custom" and need to be explicitly enabled
from extraMetrics or extraGroups parameters. Using extraMetrics: [*] will lead to accept
all metrics from the monitor.
In contrast, we can want to filter in or out coming metrics with datapointsToExclude.
You can see the official dedicated
documentation.
For example it is possible to use "whitelisting" based filtering policy:
datapointsToExclude:
- metricNames:
- '*'
- '!i_want_this_metric'
- '!and_this_one'
- '!but_no_more'
- '!than_these_4'
- Check available endpoints and their available dimensions to configure service
service discovery defining the right
discoveryRule.
$ sudo signalfx-agent status endpoints- In case of collect problem, check if the corresponding monitor is properly configured:
$ sudo signalfx-agent status monitors- If it does not appear in this list, check the SignalFx Smart Agent logs:
$ sudo journalctl -u signalfx-agent -f -n 200- Else, check if values are sent from the following command:
$ sudo signalfx-agent tap-dps