Guidance

Here is a list of usual customization rules and tips or recommendations possible to use for their detectors configuration. This is not an exhaustive list but it should provide key features for most of the users. More detailed explanations it is work are available on Templating.

Model

First, you need to understand that one metric could generate lot of different MTS (time series) and each one its own set of datapoints (depending on the reporting interval of the metric).

MTS is simply an unique combination of a metric with its all available metadata attached. Most spread metadata type is the dimensions and they are central in the detectors configuration because their behavior could highly depend on these.

For example, filtering and aggregation will use dimensions as source to operate and the relevance of alerting on heartbeat detectors depend on these filtering and aggregation "rules".

Metadata available on metrics mainly come from the data source they come from. For example, every metrics from aws will all have some dimensions in common like aws_account_id.

This also often determine how data are reported. If AWS often do not report data if there is no "change" (i.e. no traffic on ELB), the SignalFx Smart Agent for example will always its metrics at regular interval. In the same way, agent metrics are real time but cloud integrations add a delay to collect metrics.

So the data source and the metadata are crucial to create fine detectors in this repository but also very useful to use existing ones and configure them properly.

filtering

monitoring config should be as generic as possible and rely on metadata from sources
every module implement a default tagging convention (in general from user inputs)
usual logic is filtering in on
- sfx_monitored:true a common flag to enable alerting on a resource (or ignore some of them)
- env set from the module common environment variable by user
this could differ depending on the source of metrics either because dimensions are prefixed (aws_tag...) or not collected / available (newrelic, aws vpn, gcp).
if constraints do not allow to match this convention feel free to override with custom one

aggregation

do not aggregate will evaluate every single MTS separately considering every avaiable dimensions values combination. advantage: will apply to every reporting resources no matter the situation and without to know them drawback: will be sensitive to the any dimensions changes (because of a new granularity or the disappearing of a MTS)
aggregate on a set of dimension(s) allow to "group" multiple MTS into one so this restrict the evaluation to these dimensions only (i.e. mts without one of the dimensions key will be ignored) advantage: the behavior is always the same (grouping and granularity do not change) drawback: we must to know valid and available set of dimensions to define the right group but they highly depend on the env, deployment, config ..
do not consider dimension(s) is only possible aggregating to other ones (impossible without aggregation).

In general do not aggregate is the most generic and easy way but:

sometimes we want to evaluate at "higher" level (i.e. not by host but for the entire cluster)
some use cases could be very sensitive to dimensions / grouping changes (i.e. heartbeat)

heartbeat

perfect for healthcheck while it will fire alert for every group which do not report anymore.
but highly depends on aggregation which defines the groups to evaluate and consider as "unhealthy".
do not aggregate make the implementation generic but will lead to alert for every single disappearing MTS (a simple dimension change will remove old MTS and create a new one).
on another side, the aggregation group to define is not always the same and could not be an universal default.
indeed, dimensions could change depending on environment and configuration and even with same dimensions the user could want different alerting granularity (by host, by cluster ..)
as much as possible modules do not use aggregation which will work for basic scenario
some modules use aggregation because the monitor provides not relevant too high granularity for heartbeat (i.e. database dimension on postgresql will lead alert for database dropped).
but in both cases we highly recommend to define the aggregation adapted in your scenario depending of avaialble dimensions and what you expect to montior.
it is also possible to configure
vm state are filtered out automatically to support downscaling from gcp, aws and azure.
when a MTS (when no aggregation) or a group of MTS (when aggregation) disappear and lead to heartbeat alert you need to wait 36h to signalfx consider as inactive and stop to raise alert on it.

notifications

severities
levels definition / best practices
how to mapping with example

agent config tips:

standard deployment and others (disableHostDimensions + extraDimensions)
disableEndpointDimensions
dimensionTransformations
datapointsToExclude (whitelist filtering)
service discovery
kubernetes 2 deployuments
use globalDimension with caution because it will impact every metrics (and could generate heartbeat alert for each little change)
use extraDimensions to add scope / context to detectors

SignalFx/Splunk Infrastructure Monitoring | Claranet France | Claranet Terraform

Usage

Development

claranet-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance

Model

filtering

aggregation

heartbeat

notifications

agent config tips:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Menu

Usage

Development

Clone this wiki locally