Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,29 @@ module "monitor" {
```

## About

<!-- BEGIN_TF_DOCS -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

No modules.

## Resources

No resources.

## Inputs

No inputs.

## Outputs

No outputs.
<!-- END_TF_DOCS -->
11 changes: 9 additions & 2 deletions aws/alb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Configures the following for ALBs based on tags matches:

| Name | Version |
|------|---------|
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | >= 3.37 |
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | 3.37.0 |

## Modules

Expand All @@ -46,23 +46,26 @@ No modules.
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br> "resource:alb"<br>]</pre> | no |
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| <a name="input_http_5xx_responses_enabled"></a> [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| <a name="input_http_5xx_responses_evaluation_window"></a> [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_http_5xx_responses_no_data_window"></a> [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_http_5xx_responses_threshold_critical"></a> [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| <a name="input_http_5xx_responses_threshold_warning"></a> [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
| <a name="input_http_5xx_responses_use_message"></a> [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_http_5xx_tg_responses_enabled"></a> [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
| <a name="input_http_5xx_tg_responses_evaluation_window"></a> [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_http_5xx_tg_responses_no_data_window"></a> [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_http_5xx_tg_responses_threshold_critical"></a> [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| <a name="input_http_5xx_tg_responses_threshold_warning"></a> [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
| <a name="input_http_5xx_tg_responses_use_message"></a> [http\_5xx\_tg\_responses\_use\_message](#input\_http\_5xx\_tg\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_latency_enabled"></a> [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| <a name="input_latency_evaluation_window"></a> [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_latency_no_data_window"></a> [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_latency_threshold_critical"></a> [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| <a name="input_latency_threshold_warning"></a> [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
| <a name="input_latency_use_message"></a> [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
Expand All @@ -71,10 +74,14 @@ No modules.
| <a name="input_no_healthy_instances_no_data_window"></a> [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_no_healthy_instances_threshold_critical"></a> [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Critical threshold (percentage) | `number` | `0` | no |
| <a name="input_no_healthy_instances_threshold_warning"></a> [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
| <a name="input_no_healthy_instances_use_message"></a> [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message | `bool` | `true` | no |
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
Expand Down
38 changes: 19 additions & 19 deletions aws/alb/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null

title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}

resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0

name = join("", [local.title_prefix, "ALB 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
include_tags = false
message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {

query = <<END
min(${var.http_5xx_responses_evaluation_window}):
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 1)
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 1)
) * 100 > ${var.http_5xx_responses_threshold_critical}
END

Expand All @@ -42,8 +42,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
count = var.http_5xx_tg_responses_enabled ? 1 : 0

name = join("", [local.title_prefix, "ALB Target Group 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
include_tags = false
message = var.http_5xx_tg_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {

query = <<END
min(${var.http_5xx_tg_responses_evaluation_window}):
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 1)
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 1)
) * 100 > ${var.http_5xx_tg_responses_threshold_critical}
END

Expand All @@ -72,9 +72,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0

name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB latency - {{value}}s ", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
name = join("", [local.title_prefix, "ALB latency - {{loadbalancer.name}} {{value}}s", local.title_suffix])
include_tags = false
message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" {

query = <<END
avg(${var.latency_evaluation_window}):
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,loadbalancer,region}, 0
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}, 0
) > ${var.latency_threshold_critical}
END

Expand All @@ -101,9 +101,9 @@ END
resource "datadog_monitor" "no_healthy_instances" {
count = var.no_healthy_instances_enabled ? 1 : 0

name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB healthy instances is at {{value}}%", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
name = join("", [local.title_prefix, "ALB available healthy instances - {{loadbalancer.name}} {{value}}%", local.title_suffix])
include_tags = false
message = var.no_healthy_instances_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" {

query = <<END
min(${var.no_healthy_instances_evaluation_window}): (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} / (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} +
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,region,loadbalancer} )
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} / (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} +
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} )
) * 100 <= ${var.no_healthy_instances_threshold_critical}
END

Expand Down
32 changes: 28 additions & 4 deletions aws/alb/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ variable "base_tags" {
# HTTP 5xx Response Codes (ALB)
########################################
variable "http_5xx_responses_enabled" {
default = false
default = true
description = "Enable HTTP 5xx response monitor"
type = bool
}
Expand Down Expand Up @@ -46,11 +46,17 @@ variable "http_5xx_responses_threshold_warning" {
type = number
}

variable "http_5xx_responses_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# HTTP 5xx Response Codes (Target Group)
########################################
variable "http_5xx_tg_responses_enabled" {
default = false
default = true
description = "Enable HTTP 5xx response monitor (target group)"
type = bool
}
Expand Down Expand Up @@ -79,11 +85,17 @@ variable "http_5xx_tg_responses_threshold_warning" {
type = number
}

variable "http_5xx_tg_responses_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# Latency Instances
########################################
variable "latency_enabled" {
default = false
default = true
description = "Enable latency monitor"
type = bool
}
Expand All @@ -101,7 +113,7 @@ variable "latency_no_data_window" {
}

variable "latency_threshold_critical" {
default = null
default = 3
description = "Critical threshold (seconds)"
type = number
}
Expand All @@ -112,6 +124,12 @@ variable "latency_threshold_warning" {
type = number
}

variable "latency_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# No Healthy Instances
########################################
Expand Down Expand Up @@ -144,3 +162,9 @@ variable "no_healthy_instances_threshold_warning" {
description = "Warning threshold (percentage)"
type = number
}

variable "no_healthy_instances_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = true
}
Loading
Loading