diff --git a/README.md b/README.md
index 62593d8..80c209f 100644
--- a/README.md
+++ b/README.md
@@ -27,3 +27,29 @@ module "monitor" {
```
## About
+
+
+## Requirements
+
+No requirements.
+
+## Providers
+
+No providers.
+
+## Modules
+
+No modules.
+
+## Resources
+
+No resources.
+
+## Inputs
+
+No inputs.
+
+## Outputs
+
+No outputs.
+
diff --git a/aws/alb/README.md b/aws/alb/README.md
index ec3899f..039a2bb 100644
--- a/aws/alb/README.md
+++ b/aws/alb/README.md
@@ -20,7 +20,7 @@ Configures the following for ALBs based on tags matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -46,23 +46,26 @@ No modules.
| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [http\_5xx\_tg\_responses\_use\_message](#input\_http\_5xx\_tg\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
@@ -71,10 +74,14 @@ No modules.
| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Critical threshold (percentage) | `number` | `0` | no |
| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
+| [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message | `bool` | `true` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/alb/main.tf b/aws/alb/main.tf
index 30458f7..e5449ca 100644
--- a/aws/alb/main.tf
+++ b/aws/alb/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0
name = join("", [local.title_prefix, "ALB 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {
query = < ${var.http_5xx_responses_threshold_critical}
END
@@ -42,8 +42,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
count = var.http_5xx_tg_responses_enabled ? 1 : 0
name = join("", [local.title_prefix, "ALB Target Group 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.http_5xx_tg_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
query = < ${var.http_5xx_tg_responses_threshold_critical}
END
@@ -72,9 +72,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0
- name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB latency - {{value}}s ", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ALB latency - {{loadbalancer.name}} {{value}}s", local.title_suffix])
+ include_tags = false
+ message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" {
query = < ${var.latency_threshold_critical}
END
@@ -101,9 +101,9 @@ END
resource "datadog_monitor" "no_healthy_instances" {
count = var.no_healthy_instances_enabled ? 1 : 0
- name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB healthy instances is at {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ALB available healthy instances - {{loadbalancer.name}} {{value}}%", local.title_suffix])
+ include_tags = false
+ message = var.no_healthy_instances_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" {
query = < [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -42,25 +42,30 @@ No modules.
| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:apigateway"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `0.75` | no |
| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `0.25` | no |
+| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no |
| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for the latency monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/apigateway/main.tf b/aws/apigateway/main.tf
index 02033c6..f624851 100644
--- a/aws/apigateway/main.tf
+++ b/aws/apigateway/main.tf
@@ -4,16 +4,16 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0
- name = join("", [local.title_prefix, "API Gateway 5xx Responses - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "API Gateway 5xx Responses - {{apiname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {
query = < ${var.http_5xx_responses_threshold_critical}
END
@@ -41,9 +41,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0
- name = join("", [local.title_prefix, "API Gateway latency - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "API Gateway latency - {{apiname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -57,7 +57,7 @@ resource "datadog_monitor" "latency" {
query = < ${var.latency_threshold_critical}
END
diff --git a/aws/apigateway/variables.tf b/aws/apigateway/variables.tf
index 14d282a..d5eb215 100644
--- a/aws/apigateway/variables.tf
+++ b/aws/apigateway/variables.tf
@@ -46,6 +46,12 @@ variable "http_5xx_responses_threshold_warning" {
type = number
}
+variable "http_5xx_responses_use_message" {
+ description = "Whether to use the query alert base message for HTTP 5xx responses monitor"
+ type = bool
+ default = false
+}
+
########################################
# Latency Instances
########################################
@@ -78,3 +84,9 @@ variable "latency_threshold_warning" {
description = "Warning threshold (seconds)"
type = number
}
+
+variable "latency_use_message" {
+ description = "Whether to use the query alert base message for the latency monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/beanstalk/README.md b/aws/beanstalk/README.md
index 007fd00..84f314b 100644
--- a/aws/beanstalk/README.md
+++ b/aws/beanstalk/README.md
@@ -20,7 +20,7 @@ Configures the following for Beanstalk environments based on tags matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -46,31 +46,37 @@ No modules.
| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:beanstalk"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [health\_enabled](#input\_health\_enabled) | Enable Beanstalk health monitor (requires enhanced metrics) | `bool` | `false` | no |
| [health\_evaluation\_window](#input\_health\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`) | `string` | `"last_5m"` | no |
| [health\_no\_data\_window](#input\_health\_no\_data\_window) | No date threshold (minutes) | `number` | `20` | no |
| [health\_threshold\_critical](#input\_health\_threshold\_critical) | Critical threshold (
0 = OK
1 = Info
5 = Unknown
10 = No data
15 = Warning
20 = Degraded
25 = Severe
) | `number` | `25` | no |
| [health\_threshold\_warning](#input\_health\_threshold\_warning) | Warning threshold (
0 = OK
1 = Info
5 = Unknown
10 = No data
15 = Warning
20 = Degraded
25 = Severe
) | `number` | `20` | no |
+| [health\_use\_message](#input\_health\_use\_message) | Whether to use the query alert base message for health monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage) | `number` | `75` | no |
| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage) | `number` | `25` | no |
+| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no |
| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [latency\_measurement](#input\_latency\_measurement) | Latency Measurement
Valid options:
* p10
* p50
* p75
* p85
* p90
* p95
* p99
* p99\_9 | `string` | `"p50"` | no |
| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for latency monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
@@ -79,6 +85,7 @@ No modules.
| [root\_disk\_usage\_no\_data\_window](#input\_root\_disk\_usage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [root\_disk\_usage\_threshold\_critical](#input\_root\_disk\_usage\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [root\_disk\_usage\_threshold\_warning](#input\_root\_disk\_usage\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [root\_disk\_usage\_use\_message](#input\_root\_disk\_usage\_use\_message) | Whether to use the query alert base message for root disk usage monitor | `bool` | `false` | no |
| [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
| [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
diff --git a/aws/beanstalk/main.tf b/aws/beanstalk/main.tf
index f55018b..7fe3814 100644
--- a/aws/beanstalk/main.tf
+++ b/aws/beanstalk/main.tf
@@ -17,16 +17,16 @@ locals {
latency_metric = local.latency_metric_map[var.latency_measurement]
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
resource "datadog_monitor" "health" {
count = var.health_enabled ? 1 : 0
- name = join("", [local.title_prefix, "Beanstalk Health Events - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "Beanstalk Health Events - {{environmentname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.health_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "metric alert"
@@ -40,7 +40,7 @@ resource "datadog_monitor" "health" {
query = <= ${var.health_threshold_critical}
END
@@ -53,9 +53,9 @@ END
resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ALB 5xx Responses - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ALB 5xx Responses - {{environmentname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -69,8 +69,8 @@ resource "datadog_monitor" "http_5xx_responses" {
query = < ${var.http_5xx_responses_threshold_critical}
END
@@ -83,9 +83,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0
- name = join("", [local.title_prefix, "Beanstalk Latency - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "Beanstalk Latency - {{environmentname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -98,7 +98,7 @@ resource "datadog_monitor" "latency" {
timeout_h = var.timeout_h
query = <= ${var.latency_threshold_critical}
END
@@ -111,9 +111,9 @@ END
resource "datadog_monitor" "root_disk_usage" {
count = var.root_disk_usage_enabled ? 1 : 0
- name = join("", [local.title_prefix, "Beanstalk Instance Root Disk Usage - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "Beanstalk Instance Root Disk Usage - {{environmentname.name}}", local.title_suffix])
+ include_tags = false
+ message = var.root_disk_usage_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -127,7 +127,7 @@ resource "datadog_monitor" "root_disk_usage" {
query = <= ${var.root_disk_usage_threshold_critical}
END
diff --git a/aws/beanstalk/variables.tf b/aws/beanstalk/variables.tf
index 451d74d..c537346 100644
--- a/aws/beanstalk/variables.tf
+++ b/aws/beanstalk/variables.tf
@@ -68,6 +68,12 @@ Warning threshold (
END
}
+variable "health_use_message" {
+ description = "Whether to use the query alert base message for health monitor"
+ type = bool
+ default = false
+}
+
########################################
# HTTP 5xx Responses
########################################
@@ -101,6 +107,12 @@ variable "http_5xx_responses_threshold_warning" {
type = number
}
+variable "http_5xx_responses_use_message" {
+ description = "Whether to use the query alert base message for HTTP 5xx responses monitor"
+ type = bool
+ default = false
+}
+
########################################
# Latency Instances
########################################
@@ -153,6 +165,12 @@ variable "latency_threshold_warning" {
type = number
}
+variable "latency_use_message" {
+ description = "Whether to use the query alert base message for latency monitor"
+ type = bool
+ default = false
+}
+
########################################
# Root FS Disk Usage
########################################
@@ -185,3 +203,9 @@ variable "root_disk_usage_threshold_warning" {
description = "Warning threshold (percent)"
type = number
}
+
+variable "root_disk_usage_use_message" {
+ description = "Whether to use the query alert base message for root disk usage monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/ec2/README.md b/aws/ec2/README.md
index 9feda50..7679e19 100644
--- a/aws/ec2/README.md
+++ b/aws/ec2/README.md
@@ -17,7 +17,7 @@ All checks are enabled by default.
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -43,40 +43,39 @@ No modules.
| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:ec2"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
| [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
| [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
-| [status\_failed\_check\_enabled](#input\_status\_failed\_check\_enabled) | Enable ec2 instance status check monitor | `bool` | `false` | no |
+| [status\_failed\_check\_enabled](#input\_status\_failed\_check\_enabled) | Enable ec2 instance status check monitor | `bool` | `true` | no |
| [status\_failed\_check\_evaluation\_window](#input\_status\_failed\_check\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [status\_failed\_check\_no\_data\_window](#input\_status\_failed\_check\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [status\_failed\_check\_threshold\_critical](#input\_status\_failed\_check\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [status\_failed\_check\_threshold\_warning](#input\_status\_failed\_check\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [status\_failed\_instance\_enabled](#input\_status\_failed\_instance\_enabled) | Enable instance status check monitor | `bool` | `false` | no |
+| [status\_failed\_check\_use\_message](#input\_status\_failed\_check\_use\_message) | Whether to use the query alert base message for ec2 instance status check monitor | `bool` | `false` | no |
+| [status\_failed\_instance\_enabled](#input\_status\_failed\_instance\_enabled) | Enable instance status check monitor | `bool` | `true` | no |
| [status\_failed\_instance\_evaluation\_window](#input\_status\_failed\_instance\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [status\_failed\_instance\_no\_data\_window](#input\_status\_failed\_instance\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [status\_failed\_instance\_threshold\_critical](#input\_status\_failed\_instance\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [status\_failed\_instance\_threshold\_warning](#input\_status\_failed\_instance\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [status\_failed\_system\_enabled](#input\_status\_failed\_system\_enabled) | Enable instance system failure monitor | `bool` | `false` | no |
+| [status\_failed\_instance\_use\_message](#input\_status\_failed\_instance\_use\_message) | Whether to use the query alert base message for instance status check monitor | `bool` | `false` | no |
+| [status\_failed\_system\_enabled](#input\_status\_failed\_system\_enabled) | Enable instance system failure monitor | `bool` | `true` | no |
| [status\_failed\_system\_evaluation\_window](#input\_status\_failed\_system\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [status\_failed\_system\_no\_data\_window](#input\_status\_failed\_system\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [status\_failed\_system\_threshold\_critical](#input\_status\_failed\_system\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [status\_failed\_system\_threshold\_warning](#input\_status\_failed\_system\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [status\_failed\_volume\_enabled](#input\_status\_failed\_volume\_enabled) | Enable attached volume status monitor | `bool` | `false` | no |
+| [status\_failed\_system\_use\_message](#input\_status\_failed\_system\_use\_message) | Whether to use the query alert base message for instance system failure monitor | `bool` | `false` | no |
+| [status\_failed\_volume\_enabled](#input\_status\_failed\_volume\_enabled) | Enable attached volume status monitor | `bool` | `true` | no |
| [status\_failed\_volume\_evaluation\_window](#input\_status\_failed\_volume\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [status\_failed\_volume\_no\_data\_window](#input\_status\_failed\_volume\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [status\_failed\_volume\_threshold\_critical](#input\_status\_failed\_volume\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [status\_failed\_volume\_threshold\_warning](#input\_status\_failed\_volume\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [status\_failed\_volume\_use\_message](#input\_status\_failed\_volume\_use\_message) | Whether to use the query alert base message for attached volume status monitor | `bool` | `false` | no |
| [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
| [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
diff --git a/aws/ec2/main.tf b/aws/ec2/main.tf
index 3a75582..337c979 100644
--- a/aws/ec2/main.tf
+++ b/aws/ec2/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "status_failed_check" {
count = var.status_failed_check_enabled ? 1 : 0
name = join("", [local.title_prefix, "EC2 instance status - status check failure - {{name.name}}({{instance_id.name}})", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.status_failed_check_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -26,7 +26,7 @@ resource "datadog_monitor" "status_failed_check" {
query = <= 1
END
@@ -39,8 +39,8 @@ resource "datadog_monitor" "status_failed_instance" {
count = var.status_failed_instance_enabled ? 1 : 0
name = join("", [local.title_prefix, "EC2 instance status - instance failure - {{name.name}}({{instance_id.name}})", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.status_failed_instance_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -53,7 +53,7 @@ resource "datadog_monitor" "status_failed_instance" {
query = <= 1
END
@@ -66,8 +66,8 @@ resource "datadog_monitor" "status_failed_system" {
count = var.status_failed_system_enabled ? 1 : 0
name = join("", [local.title_prefix, "EC2 instance status - host failure - {{name.name}}({{instance_id.name}})", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.status_failed_system_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -80,7 +80,7 @@ resource "datadog_monitor" "status_failed_system" {
query = <= 1
END
@@ -93,8 +93,8 @@ resource "datadog_monitor" "status_failed_volume" {
count = var.status_failed_volume_enabled ? 1 : 0
name = join("", [local.title_prefix, "EC2 instance status - volume failure - {{name.name}}({{instance_id.name}})", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.status_failed_volume_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -107,7 +107,7 @@ resource "datadog_monitor" "status_failed_volume" {
query = <= 1
END
diff --git a/aws/ec2/variables.tf b/aws/ec2/variables.tf
index b27bf0d..6aaed78 100644
--- a/aws/ec2/variables.tf
+++ b/aws/ec2/variables.tf
@@ -34,6 +34,12 @@ variable "status_failed_check_no_data_window" {
type = number
}
+variable "status_failed_check_use_message" {
+ description = "Whether to use the query alert base message for ec2 instance status check monitor"
+ type = bool
+ default = false
+}
+
########################################
# Instance status check
########################################
@@ -55,6 +61,12 @@ variable "status_failed_instance_no_data_window" {
type = number
}
+variable "status_failed_instance_use_message" {
+ description = "Whether to use the query alert base message for instance status check monitor"
+ type = bool
+ default = false
+}
+
#####################################
# system host status check
########################################
@@ -76,6 +88,12 @@ variable "status_failed_system_no_data_window" {
type = number
}
+variable "status_failed_system_use_message" {
+ description = "Whether to use the query alert base message for instance system failure monitor"
+ type = bool
+ default = false
+}
+
#####################################
# Attached volume status check
########################################
@@ -96,3 +114,9 @@ variable "status_failed_volume_no_data_window" {
description = "No data threshold (in minutes, 0 to disable)"
type = number
}
+
+variable "status_failed_volume_use_message" {
+ description = "Whether to use the query alert base message for attached volume status monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/ecs-cluster/README.md b/aws/ecs-cluster/README.md
index 479ac5c..cdbab68 100644
--- a/aws/ecs-cluster/README.md
+++ b/aws/ecs-cluster/README.md
@@ -19,7 +19,7 @@ Configures the following for ECS clusters based on tags matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -44,6 +44,7 @@ No modules.
| [agent\_status\_no\_data\_window](#input\_agent\_status\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [agent\_status\_threshold\_critical](#input\_agent\_status\_threshold\_critical) | Critical threshold | `number` | `5` | no |
| [agent\_status\_threshold\_warning](#input\_agent\_status\_threshold\_warning) | Warning threshold | `number` | `3` | no |
+| [agent\_status\_use\_message](#input\_agent\_status\_use\_message) | Whether to use the query alert base message for agent status monitor | `bool` | `false` | no |
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
@@ -59,26 +60,32 @@ No modules.
| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable cluster CPU utilization monitor | `bool` | `false` | no |
| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [memory\_reservation\_enabled](#input\_memory\_reservation\_enabled) | Enable cluster memory reservation monitor | `bool` | `false` | no |
| [memory\_reservation\_evaluation\_window](#input\_memory\_reservation\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
| [memory\_reservation\_no\_data\_window](#input\_memory\_reservation\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [memory\_reservation\_threshold\_critical](#input\_memory\_reservation\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [memory\_reservation\_threshold\_warning](#input\_memory\_reservation\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [memory\_reservation\_use\_message](#input\_memory\_reservation\_use\_message) | Whether to use the query alert base message for memory reservation monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/ecs-cluster/main.tf b/aws/ecs-cluster/main.tf
index 82da113..60e5208 100644
--- a/aws/ecs-cluster/main.tf
+++ b/aws/ecs-cluster/main.tf
@@ -5,7 +5,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -13,10 +13,10 @@ resource "datadog_monitor" "agent_status" {
count = var.agent_status_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS Agent disconnected - {{clustername.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.agent_status_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
- type = "service check"
+ type = "service check"
evaluation_delay = var.evaluation_delay
new_group_delay = var.new_group_delay
@@ -27,7 +27,7 @@ resource "datadog_monitor" "agent_status" {
timeout_h = var.timeout_h
query = < ${var.cpu_utilization_threshold_critical}
END
@@ -69,8 +69,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS cluster CPU utilization anomalous activity - {{clustername.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -84,7 +84,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
query = <= ${var.cpu_utilization_anomaly_threshold_critical}
@@ -105,8 +105,8 @@ resource "datadog_monitor" "memory_reservation" {
count = var.memory_reservation_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS Cluster Memory Reservation High - {{clustername.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.memory_reservation_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -120,7 +120,7 @@ resource "datadog_monitor" "memory_reservation" {
query = < ${var.memory_reservation_threshold_critical}
END
diff --git a/aws/ecs-cluster/variables.tf b/aws/ecs-cluster/variables.tf
index e6cd277..6671c12 100644
--- a/aws/ecs-cluster/variables.tf
+++ b/aws/ecs-cluster/variables.tf
@@ -46,6 +46,12 @@ variable "agent_status_threshold_warning" {
type = number
}
+variable "agent_status_use_message" {
+ description = "Whether to use the query alert base message for agent status monitor"
+ type = bool
+ default = false
+}
+
########################################
# Cluster CPU Utilization
########################################
@@ -79,6 +85,12 @@ variable "cpu_utilization_threshold_warning" {
type = number
}
+variable "cpu_utilization_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization monitor"
+ type = bool
+ default = false
+}
+
########################################
# CPU Utilization (anomaly detection)
########################################
@@ -142,6 +154,12 @@ variable "cpu_utilization_anomaly_threshold_warning" {
type = number
}
+variable "cpu_utilization_anomaly_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization anomaly monitor"
+ type = bool
+ default = false
+}
+
########################################
# Cluster Memory Reservation
########################################
@@ -173,3 +191,9 @@ variable "memory_reservation_threshold_warning" {
description = "Warning threshold (percent)"
type = number
}
+
+variable "memory_reservation_use_message" {
+ description = "Whether to use the query alert base message for memory reservation monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/ecs-fargate/README.md b/aws/ecs-fargate/README.md
index fc4875e..9977961 100644
--- a/aws/ecs-fargate/README.md
+++ b/aws/ecs-fargate/README.md
@@ -19,7 +19,7 @@ Configures the following for ECS Fargate tasks based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -54,32 +54,39 @@ No modules.
| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `false` | no |
| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [fargate\_check\_enabled](#input\_fargate\_check\_enabled) | Enable Fargate check monitor | `bool` | `false` | no |
+| [fargate\_check\_enabled](#input\_fargate\_check\_enabled) | Enable Fargate check monitor | `bool` | `true` | no |
| [fargate\_check\_evaluation\_window](#input\_fargate\_check\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [fargate\_check\_group\_by](#input\_fargate\_check\_group\_by) | Tag to group alerts by (will result in multiple alerts being generated based on tag cardinality) | `string` | `"*"` | no |
| [fargate\_check\_no\_data\_window](#input\_fargate\_check\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [fargate\_check\_threshold\_critical](#input\_fargate\_check\_threshold\_critical) | Critical threshold | `number` | `5` | no |
| [fargate\_check\_threshold\_warning](#input\_fargate\_check\_threshold\_warning) | Warning threshold | `number` | `3` | no |
+| [fargate\_check\_use\_message](#input\_fargate\_check\_use\_message) | Whether to use the query alert base message for Fargate check monitor | `bool` | `false` | no |
| [memory\_utilization\_enabled](#input\_memory\_utilization\_enabled) | Enable Fargate task memory utilization monitor | `bool` | `false` | no |
| [memory\_utilization\_evaluation\_window](#input\_memory\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
| [memory\_utilization\_no\_data\_window](#input\_memory\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [memory\_utilization\_threshold\_critical](#input\_memory\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [memory\_utilization\_threshold\_warning](#input\_memory\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [memory\_utilization\_use\_message](#input\_memory\_utilization\_use\_message) | Whether to use the query alert base message for memory utilization monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/ecs-fargate/main.tf b/aws/ecs-fargate/main.tf
index 7bd1431..5b192a1 100644
--- a/aws/ecs-fargate/main.tf
+++ b/aws/ecs-fargate/main.tf
@@ -5,7 +5,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -13,8 +13,8 @@ resource "datadog_monitor" "fargate_check" {
count = var.fargate_check_enabled ? 1 : 0
name = join("", [local.title_prefix, "Fargate service not responding", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.fargate_check_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "service check"
@@ -40,9 +40,9 @@ END
resource "datadog_monitor" "cpu_utilization" {
count = var.cpu_utilization_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ECS Fargate task CPU utilization", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ECS Fargate task CPU utilization - {{ecs_cluster}} ({{task_family}})", local.title_suffix])
+ include_tags = false
+ message = var.cpu_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -56,7 +56,7 @@ resource "datadog_monitor" "cpu_utilization" {
query = < ${var.cpu_utilization_threshold_critical}
END
@@ -69,9 +69,9 @@ END
resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity - {{ecs_cluster}} ({{task_family}})", local.title_suffix])
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -85,7 +85,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
query = <= ${var.cpu_utilization_anomaly_threshold_critical}
@@ -105,9 +105,9 @@ END
resource "datadog_monitor" "memory_utilization" {
count = var.memory_utilization_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ECS Fargate task memory utilization", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ECS Fargate task memory utilization - {{ecs_cluster}} ({{task_family}})", local.title_suffix])
+ include_tags = false
+ message = var.memory_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -121,8 +121,8 @@ resource "datadog_monitor" "memory_utilization" {
query = <= ${var.memory_utilization_threshold_critical}
END
diff --git a/aws/ecs-fargate/variables.tf b/aws/ecs-fargate/variables.tf
index 844ddb0..272f46e 100644
--- a/aws/ecs-fargate/variables.tf
+++ b/aws/ecs-fargate/variables.tf
@@ -17,7 +17,7 @@ variable "base_tags" {
# Fargate Agent Status
########################################
variable "fargate_check_enabled" {
- default = false
+ default = true
description = "Enable Fargate check monitor"
type = bool
}
@@ -52,6 +52,12 @@ variable "fargate_check_threshold_warning" {
type = number
}
+variable "fargate_check_use_message" {
+ description = "Whether to use the query alert base message for Fargate check monitor"
+ type = bool
+ default = false
+}
+
########################################
# Fargate Task CPU Utilization
########################################
@@ -85,6 +91,12 @@ variable "cpu_utilization_threshold_warning" {
type = number
}
+variable "cpu_utilization_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization monitor"
+ type = bool
+ default = false
+}
+
########################################
# CPU Utilization (anomaly detection)
########################################
@@ -148,6 +160,12 @@ variable "cpu_utilization_anomaly_threshold_warning" {
type = number
}
+variable "cpu_utilization_anomaly_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization anomaly monitor"
+ type = bool
+ default = false
+}
+
########################################
# Fargate Task Memory Reservation
########################################
@@ -179,3 +197,9 @@ variable "memory_utilization_threshold_warning" {
description = "Warning threshold (percent)"
type = number
}
+
+variable "memory_utilization_use_message" {
+ description = "Whether to use the query alert base message for memory utilization monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/ecs-service/README.md b/aws/ecs-service/README.md
index f11e074..c7db7ba 100644
--- a/aws/ecs-service/README.md
+++ b/aws/ecs-service/README.md
@@ -19,7 +19,7 @@ Configures the following for ECS services based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -51,38 +51,45 @@ No modules.
| [cpu\_utilization\_anomaly\_recovery\_window](#input\_cpu\_utilization\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
| [cpu\_utilization\_anomaly\_rollup](#input\_cpu\_utilization\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no |
| [cpu\_utilization\_anomaly\_seasonality](#input\_cpu\_utilization\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no |
-| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
+| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `0.75` | no |
| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
-| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `false` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
+| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `true` | no |
| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `string` | `90` | no |
| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [memory\_utilization\_enabled](#input\_memory\_utilization\_enabled) | Enable Fargate task memory utilization monitor | `bool` | `false` | no |
| [memory\_utilization\_evaluation\_window](#input\_memory\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
| [memory\_utilization\_no\_data\_window](#input\_memory\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [memory\_utilization\_threshold\_critical](#input\_memory\_utilization\_threshold\_critical) | Critical threshold (percent) | `string` | `0.9` | no |
| [memory\_utilization\_threshold\_warning](#input\_memory\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `0.8` | no |
+| [memory\_utilization\_use\_message](#input\_memory\_utilization\_use\_message) | Whether to use the query alert base message for memory utilization monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
| [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
-| [running\_tasks\_enabled](#input\_running\_tasks\_enabled) | Enable running tasks monitor | `bool` | `false` | no |
+| [running\_tasks\_enabled](#input\_running\_tasks\_enabled) | Enable running tasks monitor | `bool` | `true` | no |
| [running\_tasks\_evaluation\_window](#input\_running\_tasks\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [running\_tasks\_no\_data\_window](#input\_running\_tasks\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [running\_tasks\_threshold\_critical](#input\_running\_tasks\_threshold\_critical) | Critical threshold (percentage) | `number` | `0.25` | no |
+| [running\_tasks\_threshold\_critical](#input\_running\_tasks\_threshold\_critical) | Critical threshold (percentage) | `number` | `0.5` | no |
| [running\_tasks\_threshold\_warning](#input\_running\_tasks\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
+| [running\_tasks\_use\_message](#input\_running\_tasks\_use\_message) | Whether to use the query alert base message for running tasks monitor | `bool` | `true` | no |
| [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
diff --git a/aws/ecs-service/main.tf b/aws/ecs-service/main.tf
index 0365e9b..677893b 100644
--- a/aws/ecs-service/main.tf
+++ b/aws/ecs-service/main.tf
@@ -5,7 +5,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -13,8 +13,8 @@ resource "datadog_monitor" "running_tasks" {
count = var.running_tasks_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS service failed tasks - {{servicename.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.running_tasks_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -28,8 +28,8 @@ resource "datadog_monitor" "running_tasks" {
query = <= ${var.cpu_utilization_threshold_critical}
END
@@ -72,8 +72,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity - {{servicename.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -87,7 +87,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
query = <= ${var.cpu_utilization_anomaly_threshold_critical}
@@ -108,8 +108,8 @@ resource "datadog_monitor" "memory_utilization" {
count = var.memory_utilization_enabled ? 1 : 0
name = join("", [local.title_prefix, "ECS Service memory utilization - {{servicename.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.memory_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -123,7 +123,7 @@ resource "datadog_monitor" "memory_utilization" {
query = <= ${var.memory_utilization_threshold_critical}
END
diff --git a/aws/ecs-service/variables.tf b/aws/ecs-service/variables.tf
index ba8fd6e..0c7baef 100644
--- a/aws/ecs-service/variables.tf
+++ b/aws/ecs-service/variables.tf
@@ -17,7 +17,7 @@ variable "base_tags" {
# ECS service running tasks
########################################
variable "running_tasks_enabled" {
- default = false
+ default = true
description = "Enable running tasks monitor"
type = bool
}
@@ -35,7 +35,7 @@ variable "running_tasks_no_data_window" {
}
variable "running_tasks_threshold_critical" {
- default = 0.25
+ default = 0.50
description = "Critical threshold (percentage)"
type = number
}
@@ -46,11 +46,17 @@ variable "running_tasks_threshold_warning" {
type = number
}
+variable "running_tasks_use_message" {
+ description = "Whether to use the query alert base message for running tasks monitor"
+ type = bool
+ default = true
+}
+
########################################
# Service CPU Utilization
########################################
variable "cpu_utilization_enabled" {
- default = false
+ default = true
description = "Enable Fargate task CPU utilization monitor"
type = bool
}
@@ -79,6 +85,12 @@ variable "cpu_utilization_threshold_warning" {
type = number
}
+variable "cpu_utilization_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization monitor"
+ type = bool
+ default = false
+}
+
########################################
# CPU Utilization (anomaly detection)
########################################
@@ -131,7 +143,7 @@ variable "cpu_utilization_anomaly_trigger_window" {
}
variable "cpu_utilization_anomaly_threshold_critical" {
- default = null
+ default = 0.75
description = "Critical threshold (percent)"
type = number
}
@@ -142,6 +154,13 @@ variable "cpu_utilization_anomaly_threshold_warning" {
type = number
}
+
+variable "cpu_utilization_anomaly_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization anomaly monitor"
+ type = bool
+ default = false
+}
+
########################################
# Service Memory Reservation
########################################
@@ -173,3 +192,9 @@ variable "memory_utilization_threshold_warning" {
description = "Warning threshold (percent)"
type = number
}
+
+variable "memory_utilization_use_message" {
+ description = "Whether to use the query alert base message for memory utilization monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/elasticache/README.md b/aws/elasticache/README.md
index 55933f8..67890f6 100644
--- a/aws/elasticache/README.md
+++ b/aws/elasticache/README.md
@@ -24,7 +24,7 @@ Configures the following for ElastiCache clusters based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -62,42 +62,51 @@ No modules.
| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `false` | no |
| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| [evictions\_enabled](#input\_evictions\_enabled) | Enable eviction rate monitor | `bool` | `false` | no |
| [evictions\_evaluation\_window](#input\_evictions\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [evictions\_no\_data\_window](#input\_evictions\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [evictions\_threshold\_critical](#input\_evictions\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no |
| [evictions\_threshold\_warning](#input\_evictions\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no |
+| [evictions\_use\_message](#input\_evictions\_use\_message) | Whether to use the query alert base message for evictions monitor | `bool` | `false` | no |
| [hit\_rate\_anomaly\_deviations](#input\_hit\_rate\_anomaly\_deviations) | Standard deviations | `number` | `2` | no |
| [hit\_rate\_anomaly\_enabled](#input\_hit\_rate\_anomaly\_enabled) | Enable cache hit rate anomaly monitor | `bool` | `false` | no |
| [hit\_rate\_anomaly\_evaluation\_window](#input\_hit\_rate\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
| [hit\_rate\_anomaly\_no\_data\_window](#input\_hit\_rate\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [hit\_rate\_anomaly\_seasonality](#input\_hit\_rate\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"daily"` | no |
| [hit\_rate\_anomaly\_threshold\_critical](#input\_hit\_rate\_anomaly\_threshold\_critical) | Critical threshold (percentage) | `number` | `null` | no |
+| [hit\_rate\_anomaly\_use\_message](#input\_hit\_rate\_anomaly\_use\_message) | Whether to use the query alert base message for hit rate anomaly monitor | `bool` | `false` | no |
| [hit\_rate\_enabled](#input\_hit\_rate\_enabled) | Enable cache hit rate monitor | `bool` | `false` | no |
| [hit\_rate\_evaluation\_window](#input\_hit\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [hit\_rate\_no\_data\_window](#input\_hit\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [hit\_rate\_threshold\_critical](#input\_hit\_rate\_threshold\_critical) | Critical threshold (percentage) | `number` | `null` | no |
| [hit\_rate\_threshold\_warning](#input\_hit\_rate\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
+| [hit\_rate\_use\_message](#input\_hit\_rate\_use\_message) | Whether to use the query alert base message for hit rate monitor | `bool` | `false` | no |
| [max\_connections\_enabled](#input\_max\_connections\_enabled) | Enable max connections monitor | `bool` | `false` | no |
| [max\_connections\_evaluation\_window](#input\_max\_connections\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [max\_connections\_no\_data\_window](#input\_max\_connections\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [max\_connections\_threshold\_critical](#input\_max\_connections\_threshold\_critical) | Critical threshold (connections) | `number` | `64000` | no |
| [max\_connections\_threshold\_warning](#input\_max\_connections\_threshold\_warning) | Warning threshold (connections) | `number` | `60000` | no |
+| [max\_connections\_use\_message](#input\_max\_connections\_use\_message) | Whether to use the query alert base message for max connections monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
@@ -108,6 +117,7 @@ No modules.
| [swap\_usage\_no\_data\_window](#input\_swap\_usage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [swap\_usage\_threshold\_critical](#input\_swap\_usage\_threshold\_critical) | Critical threshold (bytes) | `number` | `52428800` | no |
| [swap\_usage\_threshold\_warning](#input\_swap\_usage\_threshold\_warning) | Warning threshold (bytes) | `number` | `null` | no |
+| [swap\_usage\_use\_message](#input\_swap\_usage\_use\_message) | Whether to use the query alert base message for swap usage monitor | `bool` | `false` | no |
| [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
| [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
diff --git a/aws/elasticache/main.tf b/aws/elasticache/main.tf
index 3f7c8a5..2ad69b1 100644
--- a/aws/elasticache/main.tf
+++ b/aws/elasticache/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "cpu_utilization" {
count = var.cpu_utilization_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache CPU Utilization - {{cacheclusterid.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,7 +27,7 @@ resource "datadog_monitor" "cpu_utilization" {
query = <= ${var.cpu_utilization_threshold_critical}
END
@@ -41,8 +41,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache CPU utilization anomalous activity - {{cacheclusterid.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -56,7 +56,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
query = <= ${var.cpu_utilization_anomaly_threshold_critical}
@@ -71,8 +71,8 @@ resource "datadog_monitor" "evictions" {
count = var.evictions_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache evictions - {{cacheclusterid.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.evictions_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -86,7 +86,7 @@ resource "datadog_monitor" "evictions" {
query = <= ${var.evictions_threshold_critical}
END
@@ -100,8 +100,8 @@ resource "datadog_monitor" "hit_rate" {
count = var.hit_rate_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache cache hit rate - {{cacheclusterid.name}} - {{value}}% ", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.hit_rate_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -115,7 +115,7 @@ resource "datadog_monitor" "hit_rate" {
query = <= ${var.hit_rate_threshold_critical}
END
@@ -129,8 +129,8 @@ resource "datadog_monitor" "hit_rate_anomaly" {
count = var.hit_rate_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache cache hit rate anomalous activity - {{cacheclusterid.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.hit_rate_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -144,7 +144,7 @@ resource "datadog_monitor" "hit_rate_anomaly" {
query = <= ${var.hit_rate_anomaly_threshold_critical}
@@ -159,8 +159,8 @@ resource "datadog_monitor" "max_connections" {
count = var.max_connections_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache max connections reached - {{cacheclusterid.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.max_connections_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -174,7 +174,7 @@ resource "datadog_monitor" "max_connections" {
query = <= ${var.max_connections_threshold_critical}
END
@@ -188,8 +188,8 @@ resource "datadog_monitor" "swap_usage" {
count = var.swap_usage_enabled ? 1 : 0
name = join("", [local.title_prefix, "Elasticache swap usage - {{cacheclusterid.name}} - {{value}}MB", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.swap_usage_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -203,7 +203,7 @@ resource "datadog_monitor" "swap_usage" {
query = < [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -45,12 +45,14 @@ No modules.
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:elasticsearch"
]
| no |
-| [cluster\_health\_red\_enabled](#input\_cluster\_health\_red\_enabled) | Enable cluster health\_red monitor | `bool` | `false` | no |
+| [cluster\_health\_red\_enabled](#input\_cluster\_health\_red\_enabled) | Enable cluster health\_red monitor | `bool` | `true` | no |
| [cluster\_health\_red\_evaluation\_window](#input\_cluster\_health\_red\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cluster\_health\_red\_no\_data\_window](#input\_cluster\_health\_red\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [cluster\_health\_yellow\_enabled](#input\_cluster\_health\_yellow\_enabled) | Enable cluster health monitor | `bool` | `false` | no |
+| [cluster\_health\_red\_use\_message](#input\_cluster\_health\_red\_use\_message) | Whether to use the query alert base message for cluster health red monitor | `bool` | `true` | no |
+| [cluster\_health\_yellow\_enabled](#input\_cluster\_health\_yellow\_enabled) | Enable cluster health monitor | `bool` | `true` | no |
| [cluster\_health\_yellow\_evaluation\_window](#input\_cluster\_health\_yellow\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cluster\_health\_yellow\_no\_data\_window](#input\_cluster\_health\_yellow\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [cluster\_health\_yellow\_use\_message](#input\_cluster\_health\_yellow\_use\_message) | Whether to use the query alert base message for cluster health yellow monitor | `bool` | `false` | no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [cpu\_utilization\_anomaly\_deviations](#input\_cpu\_utilization\_anomaly\_deviations) | Standard deviations | `number` | `4` | no |
| [cpu\_utilization\_anomaly\_enabled](#input\_cpu\_utilization\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `false` | no |
@@ -62,26 +64,32 @@ No modules.
| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `false` | no |
| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `0.9` | no |
| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `0.8` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [free\_storage\_enabled](#input\_free\_storage\_enabled) | Enable free storage monitor | `bool` | `false` | no |
+| [free\_storage\_enabled](#input\_free\_storage\_enabled) | Enable free storage monitor | `bool` | `true` | no |
| [free\_storage\_evaluation\_window](#input\_free\_storage\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [free\_storage\_no\_data\_window](#input\_free\_storage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [free\_storage\_threshold\_critical](#input\_free\_storage\_threshold\_critical) | Critical threshold (GB) | `number` | `null` | no |
-| [free\_storage\_threshold\_warning](#input\_free\_storage\_threshold\_warning) | Warning threshold (GB) | `number` | `null` | no |
+| [free\_storage\_threshold\_critical](#input\_free\_storage\_threshold\_critical) | Critical threshold for used disk space (%) | `number` | `90` | no |
+| [free\_storage\_threshold\_warning](#input\_free\_storage\_threshold\_warning) | Warning threshold for used disk space (%) | `number` | `80` | no |
+| [free\_storage\_use\_message](#input\_free\_storage\_use\_message) | Whether to use the query alert base message for free storage monitor | `bool` | `true` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/elasticsearch/main.tf b/aws/elasticsearch/main.tf
index 632e503..479754c 100644
--- a/aws/elasticsearch/main.tf
+++ b/aws/elasticsearch/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "cluster_health_red" {
count = var.cluster_health_red_enabled ? 1 : 0
name = join("", [local.title_prefix, "ElasticSearch cluster health red - {{name.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cluster_health_red_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,7 +27,7 @@ resource "datadog_monitor" "cluster_health_red" {
query = <= 1
END
@@ -40,8 +40,8 @@ resource "datadog_monitor" "cluster_health_yellow" {
count = var.cluster_health_yellow_enabled ? 1 : 0
name = join("", [local.title_prefix, "ElasticSearch cluster health yellow - {{name.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cluster_health_yellow_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -55,7 +55,7 @@ resource "datadog_monitor" "cluster_health_yellow" {
query = <= 1
END
@@ -68,8 +68,8 @@ resource "datadog_monitor" "cpu_utilization" {
count = var.cpu_utilization_enabled ? 1 : 0
name = join("", [local.title_prefix, "ElasticSearch CPU Utilization - {{name.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -83,7 +83,7 @@ resource "datadog_monitor" "cpu_utilization" {
query = <= ${var.cpu_utilization_threshold_critical}
END
@@ -97,8 +97,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "ElasticSearch CPU utilization anomalous activity - {{name.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -133,8 +133,8 @@ resource "datadog_monitor" "free_storage" {
count = var.free_storage_enabled ? 1 : 0
name = join("", [local.title_prefix, "ElasticSearch cluster storage - {{name.name}} - {{value}}% used", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.free_storage_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -148,9 +148,9 @@ resource "datadog_monitor" "free_storage" {
query = < ${var.free_storage_threshold_critical}
EOQ
diff --git a/aws/elasticsearch/variables.tf b/aws/elasticsearch/variables.tf
index 971cdd4..d251705 100644
--- a/aws/elasticsearch/variables.tf
+++ b/aws/elasticsearch/variables.tf
@@ -17,7 +17,7 @@ variable "base_tags" {
# ElasticSearch cluster health (red)
########################################
variable "cluster_health_red_enabled" {
- default = false
+ default = true
description = "Enable cluster health_red monitor"
type = bool
}
@@ -34,11 +34,17 @@ variable "cluster_health_red_no_data_window" {
type = number
}
+variable "cluster_health_red_use_message" {
+ description = "Whether to use the query alert base message for cluster health red monitor"
+ type = bool
+ default = true
+}
+
#######################################
# ElasticSearch cluster health (yellow)
########################################
variable "cluster_health_yellow_enabled" {
- default = false
+ default = true
description = "Enable cluster health monitor"
type = bool
}
@@ -55,11 +61,17 @@ variable "cluster_health_yellow_no_data_window" {
type = number
}
+variable "cluster_health_yellow_use_message" {
+ description = "Whether to use the query alert base message for cluster health yellow monitor"
+ type = bool
+ default = false
+}
+
########################################
# Node CPU Utilization
########################################
variable "cpu_utilization_enabled" {
- default = false
+ default = true
description = "Enable CPU utilization monitor"
type = bool
}
@@ -88,6 +100,12 @@ variable "cpu_utilization_threshold_warning" {
type = number
}
+variable "cpu_utilization_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization monitor"
+ type = bool
+ default = false
+}
+
########################################
# CPU Utilization (anomaly detection)
########################################
@@ -151,6 +169,12 @@ variable "cpu_utilization_anomaly_threshold_warning" {
type = number
}
+variable "cpu_utilization_anomaly_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization anomaly monitor"
+ type = bool
+ default = false
+}
+
########################################
# ElasticSearch cluster free storage
########################################
@@ -173,13 +197,19 @@ variable "free_storage_evaluation_window" {
}
variable "free_storage_threshold_critical" {
- default = null
- description = "Critical threshold (GB)"
+ default = 90
+ description = "Critical threshold for used disk space (%)"
type = number
}
variable "free_storage_threshold_warning" {
- default = null
- description = "Warning threshold (GB)"
+ default = 80
+ description = "Warning threshold for used disk space (%)"
type = number
}
+
+variable "free_storage_use_message" {
+ description = "Whether to use the query alert base message for free storage monitor"
+ type = bool
+ default = true
+}
diff --git a/aws/elb/README.md b/aws/elb/README.md
index 9063d12..a0edca2 100644
--- a/aws/elb/README.md
+++ b/aws/elb/README.md
@@ -20,7 +20,7 @@ Configures the following for Classic ELBs based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -30,8 +30,8 @@ No modules.
| Name | Type |
|------|------|
+| [datadog_monitor.http_5xx_backend_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
@@ -43,37 +43,45 @@ No modules.
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
-| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:alb"
]
| no |
+| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:lb"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
+| [http\_5xx\_backend\_responses\_enabled](#input\_http\_5xx\_backend\_responses\_enabled) | Enable HTTP 5xx response monitor (backend) | `bool` | `false` | no |
+| [http\_5xx\_backend\_responses\_evaluation\_window](#input\_http\_5xx\_backend\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [http\_5xx\_backend\_responses\_no\_data\_window](#input\_http\_5xx\_backend\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [http\_5xx\_backend\_responses\_threshold\_critical](#input\_http\_5xx\_backend\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
+| [http\_5xx\_backend\_responses\_threshold\_warning](#input\_http\_5xx\_backend\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [http\_5xx\_backend\_responses\_use\_message](#input\_http\_5xx\_backend\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx backend responses monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
-| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no |
| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for latency monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no |
| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no |
+| [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Warning threshold (percentage) | `number` | `0` | no |
+| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
+| [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message for no healthy instances monitor | `bool` | `true` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
diff --git a/aws/elb/main.tf b/aws/elb/main.tf
index 182c7e2..dfce887 100644
--- a/aws/elb/main.tf
+++ b/aws/elb/main.tf
@@ -4,16 +4,16 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ELB 5xx Responses - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ELB 5xx Responses - {{loadbalancername.name}}", local.title_suffix])
+ include_tags = false
+ message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {
query = < ${var.http_5xx_responses_threshold_critical}
END
@@ -41,9 +41,9 @@ END
resource "datadog_monitor" "http_5xx_backend_responses" {
count = var.http_5xx_backend_responses_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ELB Backend 5xx Responses - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ELB Backend 5xx Responses - {{loadbalancername.name}}", local.title_suffix])
+ include_tags = false
+ message = var.http_5xx_backend_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_backend_responses" {
query = < ${var.http_5xx_backend_responses_threshold_critical}
END
@@ -72,9 +72,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ELB backend latency - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ELB backend latency - {{loadbalancername.name}}", local.title_suffix])
+ include_tags = false
+ message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" {
query = < ${var.latency_threshold_critical}
END
@@ -101,9 +101,9 @@ END
resource "datadog_monitor" "no_healthy_instances" {
count = var.no_healthy_instances_enabled ? 1 : 0
- name = join("", [local.title_prefix, "ALB healthy instances - {{host.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ name = join("", [local.title_prefix, "ALB healthy instances - {{loadbalancername.name}}", local.title_suffix])
+ include_tags = false
+ message = var.no_healthy_instances_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" {
query = < [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -33,10 +33,13 @@ No modules.
| Name | Type |
|------|------|
-| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.cold_starts](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.error_rate](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.iterator_age](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.iterator_age_forecast](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.out_of_memory](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.throttle_rate](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.timeouts](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
## Inputs
@@ -46,44 +49,68 @@ No modules.
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
-| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:alb"
]
| no |
+| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:lambda"
]
| no |
+| [cold\_starts\_enabled](#input\_cold\_starts\_enabled) | Enable cold starts monitor (requires enhanced metrics) | `bool` | `false` | no |
+| [cold\_starts\_evaluation\_window](#input\_cold\_starts\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_4h"` | no |
+| [cold\_starts\_no\_data\_window](#input\_cold\_starts\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no |
+| [cold\_starts\_threshold\_critical](#input\_cold\_starts\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no |
+| [cold\_starts\_threshold\_warning](#input\_cold\_starts\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no |
+| [cold\_starts\_use\_message](#input\_cold\_starts\_use\_message) | Whether to use the query alert base message for cold starts monitor | `bool` | `false` | no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
+| [error\_rate\_enabled](#input\_error\_rate\_enabled) | Enable Lambda error rate monitor | `bool` | `true` | no |
+| [error\_rate\_evaluation\_window](#input\_error\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [error\_rate\_no\_data\_window](#input\_error\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [error\_rate\_threshold\_critical](#input\_error\_rate\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
+| [error\_rate\_threshold\_warning](#input\_error\_rate\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [error\_rate\_use\_message](#input\_error\_rate\_use\_message) | Whether to use the query alert base message for error rate monitor | `bool` | `true` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
-| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
-| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
-| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
-| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [iterator\_age\_enabled](#input\_iterator\_age\_enabled) | Enable iterator age monitor | `bool` | `false` | no |
+| [iterator\_age\_evaluation\_window](#input\_iterator\_age\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [iterator\_age\_forecast\_enabled](#input\_iterator\_age\_forecast\_enabled) | Enable iterator age monitor | `bool` | `false` | no |
+| [iterator\_age\_forecast\_evaluation\_window](#input\_iterator\_age\_forecast\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1d"` | no |
+| [iterator\_age\_forecast\_no\_data\_window](#input\_iterator\_age\_forecast\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no |
+| [iterator\_age\_forecast\_use\_message](#input\_iterator\_age\_forecast\_use\_message) | Whether to use the query alert base message for iterator age forecast monitor | `bool` | `false` | no |
+| [iterator\_age\_no\_data\_window](#input\_iterator\_age\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no |
+| [iterator\_age\_threshold\_critical](#input\_iterator\_age\_threshold\_critical) | Critical threshold (milliseconds) | `number` | `86400000` | no |
+| [iterator\_age\_threshold\_warning](#input\_iterator\_age\_threshold\_warning) | Warning threshold (milliseconds) | `number` | `null` | no |
+| [iterator\_age\_use\_message](#input\_iterator\_age\_use\_message) | Whether to use the query alert base message for iterator age monitor | `bool` | `false` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
-| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no |
-| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [out\_of\_memory\_enabled](#input\_out\_of\_memory\_enabled) | Enable out of memory monitor (requires enhanced metrics) | `bool` | `true` | no |
+| [out\_of\_memory\_evaluation\_window](#input\_out\_of\_memory\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_4h"` | no |
+| [out\_of\_memory\_no\_data\_window](#input\_out\_of\_memory\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no |
+| [out\_of\_memory\_threshold\_critical](#input\_out\_of\_memory\_threshold\_critical) | Critical threshold (count) | `number` | `5` | no |
+| [out\_of\_memory\_threshold\_warning](#input\_out\_of\_memory\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no |
+| [out\_of\_memory\_use\_message](#input\_out\_of\_memory\_use\_message) | Whether to use the query alert base message for out of memory monitor | `bool` | `false` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
| [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
| [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
+| [throttle\_rate\_enabled](#input\_throttle\_rate\_enabled) | Enable Lambda throttle rate monitor | `bool` | `true` | no |
+| [throttle\_rate\_evaluation\_window](#input\_throttle\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [throttle\_rate\_no\_data\_window](#input\_throttle\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [throttle\_rate\_threshold\_critical](#input\_throttle\_rate\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
+| [throttle\_rate\_threshold\_warning](#input\_throttle\_rate\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
+| [throttle\_rate\_use\_message](#input\_throttle\_rate\_use\_message) | Whether to use the query alert base message for throttle rate monitor | `bool` | `false` | no |
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
+| [timeouts\_enabled](#input\_timeouts\_enabled) | Enable timeout count monitor | `bool` | `true` | no |
+| [timeouts\_evaluation\_window](#input\_timeouts\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [timeouts\_no\_data\_window](#input\_timeouts\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [timeouts\_threshold\_critical](#input\_timeouts\_threshold\_critical) | Critical threshold (count) | `number` | `75` | no |
+| [timeouts\_threshold\_warning](#input\_timeouts\_threshold\_warning) | Warning threshold (count) | `number` | `25` | no |
+| [timeouts\_use\_message](#input\_timeouts\_use\_message) | Whether to use the query alert base message for timeouts monitor | `bool` | `false` | no |
| [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
| [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no |
| [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
diff --git a/aws/lambda/main.tf b/aws/lambda/main.tf
index 1eb0d13..e37a8f4 100644
--- a/aws/lambda/main.tf
+++ b/aws/lambda/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
cold_start_query_filter = local.query_filter == "{*}" ? "{cold_start:true}" : replace(local.query_filter, "{", "{cold_star:true,")
@@ -14,8 +14,8 @@ resource "datadog_monitor" "error_rate" {
count = var.error_rate_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda error rate - {{functionname.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.error_rate_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -29,8 +29,8 @@ resource "datadog_monitor" "error_rate" {
query = < ${var.error_rate_threshold_critical}
END
@@ -44,8 +44,8 @@ resource "datadog_monitor" "timeouts" {
count = var.timeouts_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda timeouts - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.timeouts_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -59,8 +59,8 @@ resource "datadog_monitor" "timeouts" {
query = < ${var.timeouts_threshold_critical}
END
@@ -74,8 +74,8 @@ resource "datadog_monitor" "cold_starts" {
count = var.cold_starts_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda cold starts - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cold_starts_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -89,8 +89,8 @@ resource "datadog_monitor" "cold_starts" {
query = < ${var.cold_starts_threshold_critical}
END
@@ -104,8 +104,8 @@ resource "datadog_monitor" "out_of_memory" {
count = var.out_of_memory_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda out of memory - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.out_of_memory_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -134,8 +134,8 @@ resource "datadog_monitor" "iterator_age" {
count = var.iterator_age_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda iterator age - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.iterator_age_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -149,7 +149,7 @@ resource "datadog_monitor" "iterator_age" {
query = < ${var.iterator_age_threshold_critical}
END
@@ -163,8 +163,8 @@ resource "datadog_monitor" "iterator_age_forecast" {
count = var.iterator_age_forecast_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda stream data loss forecasted - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.iterator_age_forecast_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -191,8 +191,8 @@ resource "datadog_monitor" "throttle_rate" {
count = var.throttle_rate_enabled ? 1 : 0
name = join("", [local.title_prefix, "Lambda throttle rate - {{functionname.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.throttle_rate_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
diff --git a/aws/lambda/variables.tf b/aws/lambda/variables.tf
index 8aa64cc..4332d90 100644
--- a/aws/lambda/variables.tf
+++ b/aws/lambda/variables.tf
@@ -17,7 +17,7 @@ variable "base_tags" {
# Lambda error rate
########################################
variable "error_rate_enabled" {
- default = false
+ default = true
description = "Enable Lambda error rate monitor"
type = bool
}
@@ -46,11 +46,17 @@ variable "error_rate_threshold_warning" {
type = number
}
+variable "error_rate_use_message" {
+ description = "Whether to use the query alert base message for error rate monitor"
+ type = bool
+ default = true
+}
+
########################################
# Lambda timeouts
########################################
variable "timeouts_enabled" {
- default = false
+ default = true
description = "Enable timeout count monitor"
type = bool
}
@@ -79,6 +85,12 @@ variable "timeouts_threshold_warning" {
type = number
}
+variable "timeouts_use_message" {
+ description = "Whether to use the query alert base message for timeouts monitor"
+ type = bool
+ default = false
+}
+
########################################
# Cold start monitor
########################################
@@ -112,11 +124,17 @@ variable "cold_starts_threshold_warning" {
type = number
}
+variable "cold_starts_use_message" {
+ description = "Whether to use the query alert base message for cold starts monitor"
+ type = bool
+ default = false
+}
+
########################################
# OOM monitor
########################################
variable "out_of_memory_enabled" {
- default = false
+ default = true
description = "Enable out of memory monitor (requires enhanced metrics)"
type = bool
}
@@ -134,7 +152,7 @@ variable "out_of_memory_no_data_window" {
}
variable "out_of_memory_threshold_critical" {
- default = null
+ default = 5
description = "Critical threshold (count)"
type = number
}
@@ -145,6 +163,12 @@ variable "out_of_memory_threshold_warning" {
type = number
}
+variable "out_of_memory_use_message" {
+ description = "Whether to use the query alert base message for out of memory monitor"
+ type = bool
+ default = false
+}
+
########################################
# Iterator Age monitor
########################################
@@ -178,6 +202,12 @@ variable "iterator_age_threshold_warning" {
type = number
}
+variable "iterator_age_use_message" {
+ description = "Whether to use the query alert base message for iterator age monitor"
+ type = bool
+ default = false
+}
+
########################################
# Iterator Age forecast data loss
########################################
@@ -199,11 +229,17 @@ variable "iterator_age_forecast_no_data_window" {
type = number
}
+variable "iterator_age_forecast_use_message" {
+ description = "Whether to use the query alert base message for iterator age forecast monitor"
+ type = bool
+ default = false
+}
+
########################################
# Lambda throttle rate
########################################
variable "throttle_rate_enabled" {
- default = false
+ default = true
description = "Enable Lambda throttle rate monitor"
type = bool
}
@@ -231,3 +267,9 @@ variable "throttle_rate_threshold_warning" {
description = "Warning threshold (percentage, 0-100)"
type = number
}
+
+variable "throttle_rate_use_message" {
+ description = "Whether to use the query alert base message for throttle rate monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/rds/README.md b/aws/rds/README.md
index cc05203..130995c 100644
--- a/aws/rds/README.md
+++ b/aws/rds/README.md
@@ -21,7 +21,7 @@ Configures the following for RDS databases based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.37.0 |
## Modules
@@ -31,10 +31,10 @@ No modules.
| Name | Type |
|------|------|
-| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.connection_count_anomaly](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.cpu_utilization](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.cpu_utilization_anomaly](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.used_storage](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
## Inputs
@@ -44,37 +44,49 @@ No modules.
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
-| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:alb"
]
| no |
+| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:rds"
]
| no |
+| [connection\_count\_anomaly\_deviations](#input\_connection\_count\_anomaly\_deviations) | Standard deviations | `number` | `3` | no |
+| [connection\_count\_anomaly\_enabled](#input\_connection\_count\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `true` | no |
+| [connection\_count\_anomaly\_evaluation\_window](#input\_connection\_count\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [connection\_count\_anomaly\_no\_data\_window](#input\_connection\_count\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [connection\_count\_anomaly\_recovery\_window](#input\_connection\_count\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
+| [connection\_count\_anomaly\_rollup](#input\_connection\_count\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no |
+| [connection\_count\_anomaly\_seasonality](#input\_connection\_count\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no |
+| [connection\_count\_anomaly\_threshold\_critical](#input\_connection\_count\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `0.75` | no |
+| [connection\_count\_anomaly\_threshold\_warning](#input\_connection\_count\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
+| [connection\_count\_anomaly\_trigger\_window](#input\_connection\_count\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [connection\_count\_anomaly\_use\_message](#input\_connection\_count\_anomaly\_use\_message) | Whether to use the query alert base message for connection count anomaly monitor | `bool` | `true` | no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
+| [cpu\_utilization\_anomaly\_deviations](#input\_cpu\_utilization\_anomaly\_deviations) | Standard deviations | `number` | `4` | no |
+| [cpu\_utilization\_anomaly\_enabled](#input\_cpu\_utilization\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `false` | no |
+| [cpu\_utilization\_anomaly\_evaluation\_window](#input\_cpu\_utilization\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_no\_data\_window](#input\_cpu\_utilization\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [cpu\_utilization\_anomaly\_recovery\_window](#input\_cpu\_utilization\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
+| [cpu\_utilization\_anomaly\_rollup](#input\_cpu\_utilization\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no |
+| [cpu\_utilization\_anomaly\_seasonality](#input\_cpu\_utilization\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no |
+| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no |
+| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no |
+| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no |
+| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no |
+| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `true` | no |
+| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no |
+| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no |
+| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
-| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
-| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
-| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
-| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
-| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no |
-| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
@@ -84,6 +96,12 @@ No modules.
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
| [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
| [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no |
+| [used\_storage\_enabled](#input\_used\_storage\_enabled) | Enable used storage monitor | `bool` | `true` | no |
+| [used\_storage\_evaluation\_window](#input\_used\_storage\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
+| [used\_storage\_no\_data\_window](#input\_used\_storage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [used\_storage\_threshold\_critical](#input\_used\_storage\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `90` | no |
+| [used\_storage\_threshold\_warning](#input\_used\_storage\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `80` | no |
+| [used\_storage\_use\_message](#input\_used\_storage\_use\_message) | Whether to use the query alert base message for used storage monitor | `bool` | `true` | no |
| [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
## Outputs
diff --git a/aws/rds/main.tf b/aws/rds/main.tf
index bbb3292..c64956c 100644
--- a/aws/rds/main.tf
+++ b/aws/rds/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "connection_count_anomaly" {
count = var.connection_count_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "RDS connection count anomalous activity - {{dbinstanceidentifier.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.connection_count_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,7 +27,7 @@ resource "datadog_monitor" "connection_count_anomaly" {
query = <= ${var.connection_count_anomaly_threshold_critical}
@@ -48,8 +48,8 @@ resource "datadog_monitor" "cpu_utilization" {
count = var.cpu_utilization_enabled ? 1 : 0
name = join("", [local.title_prefix, "RDS CPU Utilization - {{dbinstanceidentifier.name}} - {{value}}%", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -63,7 +63,7 @@ resource "datadog_monitor" "cpu_utilization" {
query = <= ${var.cpu_utilization_threshold_critical}
END
@@ -77,8 +77,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
count = var.cpu_utilization_anomaly_enabled ? 1 : 0
name = join("", [local.title_prefix, "RDS CPU utilization anomalous activity - {{dbinstanceidentifier.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -92,7 +92,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" {
query = <= ${var.cpu_utilization_anomaly_threshold_critical}
@@ -113,8 +113,8 @@ resource "datadog_monitor" "used_storage" {
count = var.used_storage_enabled ? 1 : 0
name = join("", [local.title_prefix, "RDS instance storage - {{dbinstanceidentifier.name}} - {{value}}% used", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.used_storage_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -129,8 +129,8 @@ resource "datadog_monitor" "used_storage" {
query = <= ${var.used_storage_threshold_critical}
END
diff --git a/aws/rds/variables.tf b/aws/rds/variables.tf
index 6e74aa4..64f2191 100644
--- a/aws/rds/variables.tf
+++ b/aws/rds/variables.tf
@@ -17,7 +17,7 @@ variable "base_tags" {
# Connection Rate (anomaly detection)
########################################
variable "connection_count_anomaly_enabled" {
- default = false
+ default = true
description = "Enable CPU utilization anomaly monitor"
type = bool
}
@@ -65,7 +65,7 @@ variable "connection_count_anomaly_trigger_window" {
}
variable "connection_count_anomaly_threshold_critical" {
- default = null
+ default = 0.75
description = "Critical threshold (percent)"
type = number
}
@@ -76,11 +76,17 @@ variable "connection_count_anomaly_threshold_warning" {
type = number
}
+variable "connection_count_anomaly_use_message" {
+ description = "Whether to use the query alert base message for connection count anomaly monitor"
+ type = bool
+ default = true
+}
+
########################################
# Node CPU Utilization
########################################
variable "cpu_utilization_enabled" {
- default = false
+ default = true
description = "Enable CPU utilization monitor"
type = bool
}
@@ -109,6 +115,12 @@ variable "cpu_utilization_threshold_warning" {
type = number
}
+variable "cpu_utilization_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization monitor"
+ type = bool
+ default = false
+}
+
########################################
# CPU Utilization (anomaly detection)
########################################
@@ -172,6 +184,12 @@ variable "cpu_utilization_anomaly_threshold_warning" {
type = number
}
+variable "cpu_utilization_anomaly_use_message" {
+ description = "Whether to use the query alert base message for CPU utilization anomaly monitor"
+ type = bool
+ default = false
+}
+
########################################
# ElasticSearch cluster used storage
########################################
@@ -204,3 +222,9 @@ variable "used_storage_threshold_warning" {
description = "Warning threshold (percentage, 0-100)"
type = number
}
+
+variable "used_storage_use_message" {
+ description = "Whether to use the query alert base message for used storage monitor"
+ type = bool
+ default = true
+}
diff --git a/aws/sqs/.terraform.lock.hcl b/aws/sqs/.terraform.lock.hcl
index 5fa8913..f4429ee 100644
--- a/aws/sqs/.terraform.lock.hcl
+++ b/aws/sqs/.terraform.lock.hcl
@@ -5,6 +5,7 @@ provider "registry.terraform.io/datadog/datadog" {
version = "3.44.0"
constraints = ">= 3.37.0"
hashes = [
+ "h1:gapxzCRcnTGm4HLO1zuoelGC15+0LEYceGNWGh69JLE=",
"h1:neJ/si/8CotiW8ulfjU6dFmb1bpzbTjhfHLTlCvdynw=",
"zh:12119fe0cafbe7e05c32d4101a804d479ae756e19512c789c67cb3c51420ac98",
"zh:35267ecc27de00e449893df9a37481f38b8fe24d14fe94198cd68966f1aa586f",
@@ -27,6 +28,7 @@ provider "registry.terraform.io/hashicorp/null" {
version = "3.2.2"
constraints = ">= 3.1.0"
hashes = [
+ "h1:IMVAUHKoydFrlPrl9OzasDnw/8ntZFerCC9iXw1rXQY=",
"h1:vWAsYRd7MjYr3adj8BVKRohVfHpWQdvkIwUQ2Jf5FVM=",
"zh:3248aae6a2198f3ec8394218d05bd5e42be59f43a3a7c0b71c66ec0df08b69e7",
"zh:32b1aaa1c3013d33c245493f4a65465eab9436b454d250102729321a44c8ab9a",
diff --git a/aws/sqs/README.md b/aws/sqs/README.md
index 78b8d6e..2d27fa4 100644
--- a/aws/sqs/README.md
+++ b/aws/sqs/README.md
@@ -18,7 +18,7 @@ Configures the following for Lambda functions based on tag matches:
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.44.0 |
## Modules
@@ -28,10 +28,8 @@ No modules.
| Name | Type |
|------|------|
-| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.oldest_message](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.queue_depth](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
## Inputs
@@ -41,39 +39,35 @@ No modules.
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
-| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:alb"
]
| no |
+| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:queue"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
-| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
-| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
-| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
-| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
-| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no |
-| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [oldest\_message\_enabled](#input\_oldest\_message\_enabled) | Enable oldest queued message monitor | `bool` | `false` | no |
+| [oldest\_message\_evaluation\_window](#input\_oldest\_message\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [oldest\_message\_no\_data\_window](#input\_oldest\_message\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [oldest\_message\_threshold\_critical](#input\_oldest\_message\_threshold\_critical) | Critical threshold (seconds) | `number` | `75` | no |
+| [oldest\_message\_threshold\_warning](#input\_oldest\_message\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
+| [oldest\_message\_use\_message](#input\_oldest\_message\_use\_message) | Whether to use the query alert base message for oldest message monitor | `bool` | `false` | no |
+| [queue\_depth\_enabled](#input\_queue\_depth\_enabled) | Enable queue depth count monitor | `bool` | `false` | no |
+| [queue\_depth\_evaluation\_window](#input\_queue\_depth\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [queue\_depth\_no\_data\_window](#input\_queue\_depth\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
+| [queue\_depth\_threshold\_critical](#input\_queue\_depth\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no |
+| [queue\_depth\_threshold\_warning](#input\_queue\_depth\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no |
+| [queue\_depth\_use\_message](#input\_queue\_depth\_use\_message) | Whether to use the query alert base message for queue depth monitor | `bool` | `false` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
| [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
| [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
diff --git a/aws/sqs/main.tf b/aws/sqs/main.tf
index edbfc91..6c98447 100644
--- a/aws/sqs/main.tf
+++ b/aws/sqs/main.tf
@@ -4,7 +4,7 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null
- title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
+ title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}
@@ -12,8 +12,8 @@ resource "datadog_monitor" "oldest_message" {
count = var.oldest_message_enabled ? 1 : 0
name = join("", [local.title_prefix, "Oldest queued message - {{queuename.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.oldest_message_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,7 +27,7 @@ resource "datadog_monitor" "oldest_message" {
query = < ${var.oldest_message_threshold_critical}
END
@@ -41,8 +41,8 @@ resource "datadog_monitor" "queue_depth" {
count = var.queue_depth_enabled ? 1 : 0
name = join("", [local.title_prefix, "Queue depth - {{queuename.name}}", local.title_suffix])
- include_tags = true
- message = local.query_alert_base_message
+ include_tags = false
+ message = var.queue_depth_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -56,7 +56,7 @@ resource "datadog_monitor" "queue_depth" {
query = < ${var.queue_depth_threshold_critical}
END
diff --git a/aws/sqs/variables.tf b/aws/sqs/variables.tf
index 0a4b1c5..4bb5de0 100644
--- a/aws/sqs/variables.tf
+++ b/aws/sqs/variables.tf
@@ -46,6 +46,12 @@ variable "oldest_message_threshold_warning" {
type = number
}
+variable "oldest_message_use_message" {
+ description = "Whether to use the query alert base message for oldest message monitor"
+ type = bool
+ default = false
+}
+
########################################
# Lambda queue_depth
########################################
@@ -78,3 +84,9 @@ variable "queue_depth_threshold_warning" {
description = "Warning threshold (count)"
type = number
}
+
+variable "queue_depth_use_message" {
+ description = "Whether to use the query alert base message for queue depth monitor"
+ type = bool
+ default = false
+}
diff --git a/aws/vpn/.terraform.lock.hcl b/aws/vpn/.terraform.lock.hcl
index 5fa8913..f4429ee 100644
--- a/aws/vpn/.terraform.lock.hcl
+++ b/aws/vpn/.terraform.lock.hcl
@@ -5,6 +5,7 @@ provider "registry.terraform.io/datadog/datadog" {
version = "3.44.0"
constraints = ">= 3.37.0"
hashes = [
+ "h1:gapxzCRcnTGm4HLO1zuoelGC15+0LEYceGNWGh69JLE=",
"h1:neJ/si/8CotiW8ulfjU6dFmb1bpzbTjhfHLTlCvdynw=",
"zh:12119fe0cafbe7e05c32d4101a804d479ae756e19512c789c67cb3c51420ac98",
"zh:35267ecc27de00e449893df9a37481f38b8fe24d14fe94198cd68966f1aa586f",
@@ -27,6 +28,7 @@ provider "registry.terraform.io/hashicorp/null" {
version = "3.2.2"
constraints = ">= 3.1.0"
hashes = [
+ "h1:IMVAUHKoydFrlPrl9OzasDnw/8ntZFerCC9iXw1rXQY=",
"h1:vWAsYRd7MjYr3adj8BVKRohVfHpWQdvkIwUQ2Jf5FVM=",
"zh:3248aae6a2198f3ec8394218d05bd5e42be59f43a3a7c0b71c66ec0df08b69e7",
"zh:32b1aaa1c3013d33c245493f4a65465eab9436b454d250102729321a44c8ab9a",
diff --git a/aws/vpn/README.md b/aws/vpn/README.md
index 06a3bb5..662a44a 100644
--- a/aws/vpn/README.md
+++ b/aws/vpn/README.md
@@ -15,7 +15,7 @@ Configures up/down monitoring for VPN tunnels
| Name | Version |
|------|---------|
-| [datadog](#provider\_datadog) | >= 3.37 |
+| [datadog](#provider\_datadog) | 3.44.0 |
## Modules
@@ -25,10 +25,7 @@ No modules.
| Name | Type |
|------|------|
-| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
-| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
+| [datadog_monitor.tunnel_state](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
## Inputs
@@ -38,37 +35,21 @@ No modules.
| [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
| [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
| [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
-| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:alb"
]
| no |
+| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | [
"resource:vpn"
]
| no |
| [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
-| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
+| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
-| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
-| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
-| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
-| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
-| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
-| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
-| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
| [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
-| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no |
-| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
-| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
-| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no |
| [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
+| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
@@ -78,6 +59,9 @@ No modules.
| [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
| [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
| [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no |
+| [tunnel\_state\_enabled](#input\_tunnel\_state\_enabled) | Enable VPN tunnel state monitor | `bool` | `false` | no |
+| [tunnel\_state\_evaluation\_window](#input\_tunnel\_state\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
+| [tunnel\_state\_no\_data\_window](#input\_tunnel\_state\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
## Outputs
diff --git a/aws/vpn/main.tf b/aws/vpn/main.tf
index 304e91b..bd4df6a 100644
--- a/aws/vpn/main.tf
+++ b/aws/vpn/main.tf
@@ -12,7 +12,7 @@ resource "datadog_monitor" "tunnel_state" {
count = var.tunnel_state_enabled ? 1 : 0
name = join("", [local.title_prefix, "VPN tunnel state - {{host.name}}", local.title_suffix])
- include_tags = true
+ include_tags = false
message = local.query_alert_base_message
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"
@@ -27,7 +27,7 @@ resource "datadog_monitor" "tunnel_state" {
query = <