diff --git a/README.md b/README.md index 62593d8..80c209f 100644 --- a/README.md +++ b/README.md @@ -27,3 +27,29 @@ module "monitor" { ``` ## About + + +## Requirements + +No requirements. + +## Providers + +No providers. + +## Modules + +No modules. + +## Resources + +No resources. + +## Inputs + +No inputs. + +## Outputs + +No outputs. + diff --git a/aws/alb/README.md b/aws/alb/README.md index ec3899f..039a2bb 100644 --- a/aws/alb/README.md +++ b/aws/alb/README.md @@ -20,7 +20,7 @@ Configures the following for ALBs based on tags matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -46,23 +46,26 @@ No modules. | [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | | [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | | [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no | | [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | | [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | | [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [http\_5xx\_tg\_responses\_use\_message](#input\_http\_5xx\_tg\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no | | [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | | [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | | [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | @@ -71,10 +74,14 @@ No modules. | [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Critical threshold (percentage) | `number` | `0` | no | | [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no | +| [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message | `bool` | `true` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/alb/main.tf b/aws/alb/main.tf index 30458f7..e5449ca 100644 --- a/aws/alb/main.tf +++ b/aws/alb/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "http_5xx_responses" { count = var.http_5xx_responses_enabled ? 1 : 0 name = join("", [local.title_prefix, "ALB 5xx Responses - {{loadbalancer.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.http_5xx_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" { query = < ${var.http_5xx_responses_threshold_critical} END @@ -42,8 +42,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" { count = var.http_5xx_tg_responses_enabled ? 1 : 0 name = join("", [local.title_prefix, "ALB Target Group 5xx Responses - {{loadbalancer.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.http_5xx_tg_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" { query = < ${var.http_5xx_tg_responses_threshold_critical} END @@ -72,9 +72,9 @@ END resource "datadog_monitor" "latency" { count = var.latency_enabled ? 1 : 0 - name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB latency - {{value}}s ", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ALB latency - {{loadbalancer.name}} {{value}}s", local.title_suffix]) + include_tags = false + message = var.latency_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" { query = < ${var.latency_threshold_critical} END @@ -101,9 +101,9 @@ END resource "datadog_monitor" "no_healthy_instances" { count = var.no_healthy_instances_enabled ? 1 : 0 - name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB healthy instances is at {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ALB available healthy instances - {{loadbalancer.name}} {{value}}%", local.title_suffix]) + include_tags = false + message = var.no_healthy_instances_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" { query = < [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -42,25 +42,30 @@ No modules. | [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:apigateway"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | | [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `0.75` | no | | [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `0.25` | no | +| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no | | [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | | [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | | [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for the latency monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/apigateway/main.tf b/aws/apigateway/main.tf index 02033c6..f624851 100644 --- a/aws/apigateway/main.tf +++ b/aws/apigateway/main.tf @@ -4,16 +4,16 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } resource "datadog_monitor" "http_5xx_responses" { count = var.http_5xx_responses_enabled ? 1 : 0 - name = join("", [local.title_prefix, "API Gateway 5xx Responses - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "API Gateway 5xx Responses - {{apiname.name}}", local.title_suffix]) + include_tags = false + message = var.http_5xx_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" { query = < ${var.http_5xx_responses_threshold_critical} END @@ -41,9 +41,9 @@ END resource "datadog_monitor" "latency" { count = var.latency_enabled ? 1 : 0 - name = join("", [local.title_prefix, "API Gateway latency - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "API Gateway latency - {{apiname.name}}", local.title_suffix]) + include_tags = false + message = var.latency_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -57,7 +57,7 @@ resource "datadog_monitor" "latency" { query = < ${var.latency_threshold_critical} END diff --git a/aws/apigateway/variables.tf b/aws/apigateway/variables.tf index 14d282a..d5eb215 100644 --- a/aws/apigateway/variables.tf +++ b/aws/apigateway/variables.tf @@ -46,6 +46,12 @@ variable "http_5xx_responses_threshold_warning" { type = number } +variable "http_5xx_responses_use_message" { + description = "Whether to use the query alert base message for HTTP 5xx responses monitor" + type = bool + default = false +} + ######################################## # Latency Instances ######################################## @@ -78,3 +84,9 @@ variable "latency_threshold_warning" { description = "Warning threshold (seconds)" type = number } + +variable "latency_use_message" { + description = "Whether to use the query alert base message for the latency monitor" + type = bool + default = false +} diff --git a/aws/beanstalk/README.md b/aws/beanstalk/README.md index 007fd00..84f314b 100644 --- a/aws/beanstalk/README.md +++ b/aws/beanstalk/README.md @@ -20,7 +20,7 @@ Configures the following for Beanstalk environments based on tags matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -46,31 +46,37 @@ No modules. | [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:beanstalk"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [health\_enabled](#input\_health\_enabled) | Enable Beanstalk health monitor (requires enhanced metrics) | `bool` | `false` | no | | [health\_evaluation\_window](#input\_health\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`) | `string` | `"last_5m"` | no | | [health\_no\_data\_window](#input\_health\_no\_data\_window) | No date threshold (minutes) | `number` | `20` | no | | [health\_threshold\_critical](#input\_health\_threshold\_critical) | Critical threshold (
0 = OK
1 = Info
5 = Unknown
10 = No data
15 = Warning
20 = Degraded
25 = Severe
) | `number` | `25` | no | | [health\_threshold\_warning](#input\_health\_threshold\_warning) | Warning threshold (
0 = OK
1 = Info
5 = Unknown
10 = No data
15 = Warning
20 = Degraded
25 = Severe
) | `number` | `20` | no | +| [health\_use\_message](#input\_health\_use\_message) | Whether to use the query alert base message for health monitor | `bool` | `false` | no | | [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | | [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage) | `number` | `75` | no | | [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage) | `number` | `25` | no | +| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no | | [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | | [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [latency\_measurement](#input\_latency\_measurement) | Latency Measurement

Valid options:
* p10
* p50
* p75
* p85
* p90
* p95
* p99
* p99\_9 | `string` | `"p50"` | no | | [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | | [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for latency monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | @@ -79,6 +85,7 @@ No modules. | [root\_disk\_usage\_no\_data\_window](#input\_root\_disk\_usage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [root\_disk\_usage\_threshold\_critical](#input\_root\_disk\_usage\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [root\_disk\_usage\_threshold\_warning](#input\_root\_disk\_usage\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [root\_disk\_usage\_use\_message](#input\_root\_disk\_usage\_use\_message) | Whether to use the query alert base message for root disk usage monitor | `bool` | `false` | no | | [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no | | [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no | diff --git a/aws/beanstalk/main.tf b/aws/beanstalk/main.tf index f55018b..7fe3814 100644 --- a/aws/beanstalk/main.tf +++ b/aws/beanstalk/main.tf @@ -17,16 +17,16 @@ locals { latency_metric = local.latency_metric_map[var.latency_measurement] - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } resource "datadog_monitor" "health" { count = var.health_enabled ? 1 : 0 - name = join("", [local.title_prefix, "Beanstalk Health Events - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "Beanstalk Health Events - {{environmentname.name}}", local.title_suffix]) + include_tags = false + message = var.health_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "metric alert" @@ -40,7 +40,7 @@ resource "datadog_monitor" "health" { query = <= ${var.health_threshold_critical} END @@ -53,9 +53,9 @@ END resource "datadog_monitor" "http_5xx_responses" { count = var.http_5xx_responses_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ALB 5xx Responses - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ALB 5xx Responses - {{environmentname.name}}", local.title_suffix]) + include_tags = false + message = var.http_5xx_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -69,8 +69,8 @@ resource "datadog_monitor" "http_5xx_responses" { query = < ${var.http_5xx_responses_threshold_critical} END @@ -83,9 +83,9 @@ END resource "datadog_monitor" "latency" { count = var.latency_enabled ? 1 : 0 - name = join("", [local.title_prefix, "Beanstalk Latency - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "Beanstalk Latency - {{environmentname.name}}", local.title_suffix]) + include_tags = false + message = var.latency_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -98,7 +98,7 @@ resource "datadog_monitor" "latency" { timeout_h = var.timeout_h query = <= ${var.latency_threshold_critical} END @@ -111,9 +111,9 @@ END resource "datadog_monitor" "root_disk_usage" { count = var.root_disk_usage_enabled ? 1 : 0 - name = join("", [local.title_prefix, "Beanstalk Instance Root Disk Usage - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "Beanstalk Instance Root Disk Usage - {{environmentname.name}}", local.title_suffix]) + include_tags = false + message = var.root_disk_usage_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -127,7 +127,7 @@ resource "datadog_monitor" "root_disk_usage" { query = <= ${var.root_disk_usage_threshold_critical} END diff --git a/aws/beanstalk/variables.tf b/aws/beanstalk/variables.tf index 451d74d..c537346 100644 --- a/aws/beanstalk/variables.tf +++ b/aws/beanstalk/variables.tf @@ -68,6 +68,12 @@ Warning threshold ( END } +variable "health_use_message" { + description = "Whether to use the query alert base message for health monitor" + type = bool + default = false +} + ######################################## # HTTP 5xx Responses ######################################## @@ -101,6 +107,12 @@ variable "http_5xx_responses_threshold_warning" { type = number } +variable "http_5xx_responses_use_message" { + description = "Whether to use the query alert base message for HTTP 5xx responses monitor" + type = bool + default = false +} + ######################################## # Latency Instances ######################################## @@ -153,6 +165,12 @@ variable "latency_threshold_warning" { type = number } +variable "latency_use_message" { + description = "Whether to use the query alert base message for latency monitor" + type = bool + default = false +} + ######################################## # Root FS Disk Usage ######################################## @@ -185,3 +203,9 @@ variable "root_disk_usage_threshold_warning" { description = "Warning threshold (percent)" type = number } + +variable "root_disk_usage_use_message" { + description = "Whether to use the query alert base message for root disk usage monitor" + type = bool + default = false +} diff --git a/aws/ec2/README.md b/aws/ec2/README.md index 9feda50..7679e19 100644 --- a/aws/ec2/README.md +++ b/aws/ec2/README.md @@ -17,7 +17,7 @@ All checks are enabled by default. | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -43,40 +43,39 @@ No modules. | [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:ec2"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | | [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no | | [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no | -| [status\_failed\_check\_enabled](#input\_status\_failed\_check\_enabled) | Enable ec2 instance status check monitor | `bool` | `false` | no | +| [status\_failed\_check\_enabled](#input\_status\_failed\_check\_enabled) | Enable ec2 instance status check monitor | `bool` | `true` | no | | [status\_failed\_check\_evaluation\_window](#input\_status\_failed\_check\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [status\_failed\_check\_no\_data\_window](#input\_status\_failed\_check\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [status\_failed\_check\_threshold\_critical](#input\_status\_failed\_check\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [status\_failed\_check\_threshold\_warning](#input\_status\_failed\_check\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [status\_failed\_instance\_enabled](#input\_status\_failed\_instance\_enabled) | Enable instance status check monitor | `bool` | `false` | no | +| [status\_failed\_check\_use\_message](#input\_status\_failed\_check\_use\_message) | Whether to use the query alert base message for ec2 instance status check monitor | `bool` | `false` | no | +| [status\_failed\_instance\_enabled](#input\_status\_failed\_instance\_enabled) | Enable instance status check monitor | `bool` | `true` | no | | [status\_failed\_instance\_evaluation\_window](#input\_status\_failed\_instance\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [status\_failed\_instance\_no\_data\_window](#input\_status\_failed\_instance\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [status\_failed\_instance\_threshold\_critical](#input\_status\_failed\_instance\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [status\_failed\_instance\_threshold\_warning](#input\_status\_failed\_instance\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [status\_failed\_system\_enabled](#input\_status\_failed\_system\_enabled) | Enable instance system failure monitor | `bool` | `false` | no | +| [status\_failed\_instance\_use\_message](#input\_status\_failed\_instance\_use\_message) | Whether to use the query alert base message for instance status check monitor | `bool` | `false` | no | +| [status\_failed\_system\_enabled](#input\_status\_failed\_system\_enabled) | Enable instance system failure monitor | `bool` | `true` | no | | [status\_failed\_system\_evaluation\_window](#input\_status\_failed\_system\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [status\_failed\_system\_no\_data\_window](#input\_status\_failed\_system\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [status\_failed\_system\_threshold\_critical](#input\_status\_failed\_system\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [status\_failed\_system\_threshold\_warning](#input\_status\_failed\_system\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [status\_failed\_volume\_enabled](#input\_status\_failed\_volume\_enabled) | Enable attached volume status monitor | `bool` | `false` | no | +| [status\_failed\_system\_use\_message](#input\_status\_failed\_system\_use\_message) | Whether to use the query alert base message for instance system failure monitor | `bool` | `false` | no | +| [status\_failed\_volume\_enabled](#input\_status\_failed\_volume\_enabled) | Enable attached volume status monitor | `bool` | `true` | no | | [status\_failed\_volume\_evaluation\_window](#input\_status\_failed\_volume\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [status\_failed\_volume\_no\_data\_window](#input\_status\_failed\_volume\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [status\_failed\_volume\_threshold\_critical](#input\_status\_failed\_volume\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [status\_failed\_volume\_threshold\_warning](#input\_status\_failed\_volume\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [status\_failed\_volume\_use\_message](#input\_status\_failed\_volume\_use\_message) | Whether to use the query alert base message for attached volume status monitor | `bool` | `false` | no | | [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | | [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no | diff --git a/aws/ec2/main.tf b/aws/ec2/main.tf index 3a75582..337c979 100644 --- a/aws/ec2/main.tf +++ b/aws/ec2/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "status_failed_check" { count = var.status_failed_check_enabled ? 1 : 0 name = join("", [local.title_prefix, "EC2 instance status - status check failure - {{name.name}}({{instance_id.name}})", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.status_failed_check_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -26,7 +26,7 @@ resource "datadog_monitor" "status_failed_check" { query = <= 1 END @@ -39,8 +39,8 @@ resource "datadog_monitor" "status_failed_instance" { count = var.status_failed_instance_enabled ? 1 : 0 name = join("", [local.title_prefix, "EC2 instance status - instance failure - {{name.name}}({{instance_id.name}})", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.status_failed_instance_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -53,7 +53,7 @@ resource "datadog_monitor" "status_failed_instance" { query = <= 1 END @@ -66,8 +66,8 @@ resource "datadog_monitor" "status_failed_system" { count = var.status_failed_system_enabled ? 1 : 0 name = join("", [local.title_prefix, "EC2 instance status - host failure - {{name.name}}({{instance_id.name}})", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.status_failed_system_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -80,7 +80,7 @@ resource "datadog_monitor" "status_failed_system" { query = <= 1 END @@ -93,8 +93,8 @@ resource "datadog_monitor" "status_failed_volume" { count = var.status_failed_volume_enabled ? 1 : 0 name = join("", [local.title_prefix, "EC2 instance status - volume failure - {{name.name}}({{instance_id.name}})", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.status_failed_volume_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -107,7 +107,7 @@ resource "datadog_monitor" "status_failed_volume" { query = <= 1 END diff --git a/aws/ec2/variables.tf b/aws/ec2/variables.tf index b27bf0d..6aaed78 100644 --- a/aws/ec2/variables.tf +++ b/aws/ec2/variables.tf @@ -34,6 +34,12 @@ variable "status_failed_check_no_data_window" { type = number } +variable "status_failed_check_use_message" { + description = "Whether to use the query alert base message for ec2 instance status check monitor" + type = bool + default = false +} + ######################################## # Instance status check ######################################## @@ -55,6 +61,12 @@ variable "status_failed_instance_no_data_window" { type = number } +variable "status_failed_instance_use_message" { + description = "Whether to use the query alert base message for instance status check monitor" + type = bool + default = false +} + ##################################### # system host status check ######################################## @@ -76,6 +88,12 @@ variable "status_failed_system_no_data_window" { type = number } +variable "status_failed_system_use_message" { + description = "Whether to use the query alert base message for instance system failure monitor" + type = bool + default = false +} + ##################################### # Attached volume status check ######################################## @@ -96,3 +114,9 @@ variable "status_failed_volume_no_data_window" { description = "No data threshold (in minutes, 0 to disable)" type = number } + +variable "status_failed_volume_use_message" { + description = "Whether to use the query alert base message for attached volume status monitor" + type = bool + default = false +} diff --git a/aws/ecs-cluster/README.md b/aws/ecs-cluster/README.md index 479ac5c..cdbab68 100644 --- a/aws/ecs-cluster/README.md +++ b/aws/ecs-cluster/README.md @@ -19,7 +19,7 @@ Configures the following for ECS clusters based on tags matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -44,6 +44,7 @@ No modules. | [agent\_status\_no\_data\_window](#input\_agent\_status\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [agent\_status\_threshold\_critical](#input\_agent\_status\_threshold\_critical) | Critical threshold | `number` | `5` | no | | [agent\_status\_threshold\_warning](#input\_agent\_status\_threshold\_warning) | Warning threshold | `number` | `3` | no | +| [agent\_status\_use\_message](#input\_agent\_status\_use\_message) | Whether to use the query alert base message for agent status monitor | `bool` | `false` | no | | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | @@ -59,26 +60,32 @@ No modules. | [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | | [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable cluster CPU utilization monitor | `bool` | `false` | no | | [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [memory\_reservation\_enabled](#input\_memory\_reservation\_enabled) | Enable cluster memory reservation monitor | `bool` | `false` | no | | [memory\_reservation\_evaluation\_window](#input\_memory\_reservation\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | | [memory\_reservation\_no\_data\_window](#input\_memory\_reservation\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [memory\_reservation\_threshold\_critical](#input\_memory\_reservation\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [memory\_reservation\_threshold\_warning](#input\_memory\_reservation\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [memory\_reservation\_use\_message](#input\_memory\_reservation\_use\_message) | Whether to use the query alert base message for memory reservation monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/ecs-cluster/main.tf b/aws/ecs-cluster/main.tf index 82da113..60e5208 100644 --- a/aws/ecs-cluster/main.tf +++ b/aws/ecs-cluster/main.tf @@ -5,7 +5,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -13,10 +13,10 @@ resource "datadog_monitor" "agent_status" { count = var.agent_status_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS Agent disconnected - {{clustername.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.agent_status_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) - type = "service check" + type = "service check" evaluation_delay = var.evaluation_delay new_group_delay = var.new_group_delay @@ -27,7 +27,7 @@ resource "datadog_monitor" "agent_status" { timeout_h = var.timeout_h query = < ${var.cpu_utilization_threshold_critical} END @@ -69,8 +69,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS cluster CPU utilization anomalous activity - {{clustername.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -84,7 +84,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { query = <= ${var.cpu_utilization_anomaly_threshold_critical} @@ -105,8 +105,8 @@ resource "datadog_monitor" "memory_reservation" { count = var.memory_reservation_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS Cluster Memory Reservation High - {{clustername.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.memory_reservation_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -120,7 +120,7 @@ resource "datadog_monitor" "memory_reservation" { query = < ${var.memory_reservation_threshold_critical} END diff --git a/aws/ecs-cluster/variables.tf b/aws/ecs-cluster/variables.tf index e6cd277..6671c12 100644 --- a/aws/ecs-cluster/variables.tf +++ b/aws/ecs-cluster/variables.tf @@ -46,6 +46,12 @@ variable "agent_status_threshold_warning" { type = number } +variable "agent_status_use_message" { + description = "Whether to use the query alert base message for agent status monitor" + type = bool + default = false +} + ######################################## # Cluster CPU Utilization ######################################## @@ -79,6 +85,12 @@ variable "cpu_utilization_threshold_warning" { type = number } +variable "cpu_utilization_use_message" { + description = "Whether to use the query alert base message for CPU utilization monitor" + type = bool + default = false +} + ######################################## # CPU Utilization (anomaly detection) ######################################## @@ -142,6 +154,12 @@ variable "cpu_utilization_anomaly_threshold_warning" { type = number } +variable "cpu_utilization_anomaly_use_message" { + description = "Whether to use the query alert base message for CPU utilization anomaly monitor" + type = bool + default = false +} + ######################################## # Cluster Memory Reservation ######################################## @@ -173,3 +191,9 @@ variable "memory_reservation_threshold_warning" { description = "Warning threshold (percent)" type = number } + +variable "memory_reservation_use_message" { + description = "Whether to use the query alert base message for memory reservation monitor" + type = bool + default = false +} diff --git a/aws/ecs-fargate/README.md b/aws/ecs-fargate/README.md index fc4875e..9977961 100644 --- a/aws/ecs-fargate/README.md +++ b/aws/ecs-fargate/README.md @@ -19,7 +19,7 @@ Configures the following for ECS Fargate tasks based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -54,32 +54,39 @@ No modules. | [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | | [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `false` | no | | [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [fargate\_check\_enabled](#input\_fargate\_check\_enabled) | Enable Fargate check monitor | `bool` | `false` | no | +| [fargate\_check\_enabled](#input\_fargate\_check\_enabled) | Enable Fargate check monitor | `bool` | `true` | no | | [fargate\_check\_evaluation\_window](#input\_fargate\_check\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [fargate\_check\_group\_by](#input\_fargate\_check\_group\_by) | Tag to group alerts by (will result in multiple alerts being generated based on tag cardinality) | `string` | `"*"` | no | | [fargate\_check\_no\_data\_window](#input\_fargate\_check\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [fargate\_check\_threshold\_critical](#input\_fargate\_check\_threshold\_critical) | Critical threshold | `number` | `5` | no | | [fargate\_check\_threshold\_warning](#input\_fargate\_check\_threshold\_warning) | Warning threshold | `number` | `3` | no | +| [fargate\_check\_use\_message](#input\_fargate\_check\_use\_message) | Whether to use the query alert base message for Fargate check monitor | `bool` | `false` | no | | [memory\_utilization\_enabled](#input\_memory\_utilization\_enabled) | Enable Fargate task memory utilization monitor | `bool` | `false` | no | | [memory\_utilization\_evaluation\_window](#input\_memory\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | | [memory\_utilization\_no\_data\_window](#input\_memory\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [memory\_utilization\_threshold\_critical](#input\_memory\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [memory\_utilization\_threshold\_warning](#input\_memory\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [memory\_utilization\_use\_message](#input\_memory\_utilization\_use\_message) | Whether to use the query alert base message for memory utilization monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/ecs-fargate/main.tf b/aws/ecs-fargate/main.tf index 7bd1431..5b192a1 100644 --- a/aws/ecs-fargate/main.tf +++ b/aws/ecs-fargate/main.tf @@ -5,7 +5,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -13,8 +13,8 @@ resource "datadog_monitor" "fargate_check" { count = var.fargate_check_enabled ? 1 : 0 name = join("", [local.title_prefix, "Fargate service not responding", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.fargate_check_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "service check" @@ -40,9 +40,9 @@ END resource "datadog_monitor" "cpu_utilization" { count = var.cpu_utilization_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ECS Fargate task CPU utilization", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ECS Fargate task CPU utilization - {{ecs_cluster}} ({{task_family}})", local.title_suffix]) + include_tags = false + message = var.cpu_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -56,7 +56,7 @@ resource "datadog_monitor" "cpu_utilization" { query = < ${var.cpu_utilization_threshold_critical} END @@ -69,9 +69,9 @@ END resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity - {{ecs_cluster}} ({{task_family}})", local.title_suffix]) + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -85,7 +85,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { query = <= ${var.cpu_utilization_anomaly_threshold_critical} @@ -105,9 +105,9 @@ END resource "datadog_monitor" "memory_utilization" { count = var.memory_utilization_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ECS Fargate task memory utilization", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ECS Fargate task memory utilization - {{ecs_cluster}} ({{task_family}})", local.title_suffix]) + include_tags = false + message = var.memory_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -121,8 +121,8 @@ resource "datadog_monitor" "memory_utilization" { query = <= ${var.memory_utilization_threshold_critical} END diff --git a/aws/ecs-fargate/variables.tf b/aws/ecs-fargate/variables.tf index 844ddb0..272f46e 100644 --- a/aws/ecs-fargate/variables.tf +++ b/aws/ecs-fargate/variables.tf @@ -17,7 +17,7 @@ variable "base_tags" { # Fargate Agent Status ######################################## variable "fargate_check_enabled" { - default = false + default = true description = "Enable Fargate check monitor" type = bool } @@ -52,6 +52,12 @@ variable "fargate_check_threshold_warning" { type = number } +variable "fargate_check_use_message" { + description = "Whether to use the query alert base message for Fargate check monitor" + type = bool + default = false +} + ######################################## # Fargate Task CPU Utilization ######################################## @@ -85,6 +91,12 @@ variable "cpu_utilization_threshold_warning" { type = number } +variable "cpu_utilization_use_message" { + description = "Whether to use the query alert base message for CPU utilization monitor" + type = bool + default = false +} + ######################################## # CPU Utilization (anomaly detection) ######################################## @@ -148,6 +160,12 @@ variable "cpu_utilization_anomaly_threshold_warning" { type = number } +variable "cpu_utilization_anomaly_use_message" { + description = "Whether to use the query alert base message for CPU utilization anomaly monitor" + type = bool + default = false +} + ######################################## # Fargate Task Memory Reservation ######################################## @@ -179,3 +197,9 @@ variable "memory_utilization_threshold_warning" { description = "Warning threshold (percent)" type = number } + +variable "memory_utilization_use_message" { + description = "Whether to use the query alert base message for memory utilization monitor" + type = bool + default = false +} diff --git a/aws/ecs-service/README.md b/aws/ecs-service/README.md index f11e074..c7db7ba 100644 --- a/aws/ecs-service/README.md +++ b/aws/ecs-service/README.md @@ -19,7 +19,7 @@ Configures the following for ECS services based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -51,38 +51,45 @@ No modules. | [cpu\_utilization\_anomaly\_recovery\_window](#input\_cpu\_utilization\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | | [cpu\_utilization\_anomaly\_rollup](#input\_cpu\_utilization\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no | | [cpu\_utilization\_anomaly\_seasonality](#input\_cpu\_utilization\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no | -| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | +| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `0.75` | no | | [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | -| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `false` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | +| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable Fargate task CPU utilization monitor | `bool` | `true` | no | | [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `string` | `90` | no | | [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [memory\_utilization\_enabled](#input\_memory\_utilization\_enabled) | Enable Fargate task memory utilization monitor | `bool` | `false` | no | | [memory\_utilization\_evaluation\_window](#input\_memory\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | | [memory\_utilization\_no\_data\_window](#input\_memory\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [memory\_utilization\_threshold\_critical](#input\_memory\_utilization\_threshold\_critical) | Critical threshold (percent) | `string` | `0.9` | no | | [memory\_utilization\_threshold\_warning](#input\_memory\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `0.8` | no | +| [memory\_utilization\_use\_message](#input\_memory\_utilization\_use\_message) | Whether to use the query alert base message for memory utilization monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | | [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no | -| [running\_tasks\_enabled](#input\_running\_tasks\_enabled) | Enable running tasks monitor | `bool` | `false` | no | +| [running\_tasks\_enabled](#input\_running\_tasks\_enabled) | Enable running tasks monitor | `bool` | `true` | no | | [running\_tasks\_evaluation\_window](#input\_running\_tasks\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [running\_tasks\_no\_data\_window](#input\_running\_tasks\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [running\_tasks\_threshold\_critical](#input\_running\_tasks\_threshold\_critical) | Critical threshold (percentage) | `number` | `0.25` | no | +| [running\_tasks\_threshold\_critical](#input\_running\_tasks\_threshold\_critical) | Critical threshold (percentage) | `number` | `0.5` | no | | [running\_tasks\_threshold\_warning](#input\_running\_tasks\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no | +| [running\_tasks\_use\_message](#input\_running\_tasks\_use\_message) | Whether to use the query alert base message for running tasks monitor | `bool` | `true` | no | | [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | diff --git a/aws/ecs-service/main.tf b/aws/ecs-service/main.tf index 0365e9b..677893b 100644 --- a/aws/ecs-service/main.tf +++ b/aws/ecs-service/main.tf @@ -5,7 +5,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -13,8 +13,8 @@ resource "datadog_monitor" "running_tasks" { count = var.running_tasks_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS service failed tasks - {{servicename.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.running_tasks_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -28,8 +28,8 @@ resource "datadog_monitor" "running_tasks" { query = <= ${var.cpu_utilization_threshold_critical} END @@ -72,8 +72,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS service CPU utilization anomalous activity - {{servicename.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -87,7 +87,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { query = <= ${var.cpu_utilization_anomaly_threshold_critical} @@ -108,8 +108,8 @@ resource "datadog_monitor" "memory_utilization" { count = var.memory_utilization_enabled ? 1 : 0 name = join("", [local.title_prefix, "ECS Service memory utilization - {{servicename.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.memory_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -123,7 +123,7 @@ resource "datadog_monitor" "memory_utilization" { query = <= ${var.memory_utilization_threshold_critical} END diff --git a/aws/ecs-service/variables.tf b/aws/ecs-service/variables.tf index ba8fd6e..0c7baef 100644 --- a/aws/ecs-service/variables.tf +++ b/aws/ecs-service/variables.tf @@ -17,7 +17,7 @@ variable "base_tags" { # ECS service running tasks ######################################## variable "running_tasks_enabled" { - default = false + default = true description = "Enable running tasks monitor" type = bool } @@ -35,7 +35,7 @@ variable "running_tasks_no_data_window" { } variable "running_tasks_threshold_critical" { - default = 0.25 + default = 0.50 description = "Critical threshold (percentage)" type = number } @@ -46,11 +46,17 @@ variable "running_tasks_threshold_warning" { type = number } +variable "running_tasks_use_message" { + description = "Whether to use the query alert base message for running tasks monitor" + type = bool + default = true +} + ######################################## # Service CPU Utilization ######################################## variable "cpu_utilization_enabled" { - default = false + default = true description = "Enable Fargate task CPU utilization monitor" type = bool } @@ -79,6 +85,12 @@ variable "cpu_utilization_threshold_warning" { type = number } +variable "cpu_utilization_use_message" { + description = "Whether to use the query alert base message for CPU utilization monitor" + type = bool + default = false +} + ######################################## # CPU Utilization (anomaly detection) ######################################## @@ -131,7 +143,7 @@ variable "cpu_utilization_anomaly_trigger_window" { } variable "cpu_utilization_anomaly_threshold_critical" { - default = null + default = 0.75 description = "Critical threshold (percent)" type = number } @@ -142,6 +154,13 @@ variable "cpu_utilization_anomaly_threshold_warning" { type = number } + +variable "cpu_utilization_anomaly_use_message" { + description = "Whether to use the query alert base message for CPU utilization anomaly monitor" + type = bool + default = false +} + ######################################## # Service Memory Reservation ######################################## @@ -173,3 +192,9 @@ variable "memory_utilization_threshold_warning" { description = "Warning threshold (percent)" type = number } + +variable "memory_utilization_use_message" { + description = "Whether to use the query alert base message for memory utilization monitor" + type = bool + default = false +} diff --git a/aws/elasticache/README.md b/aws/elasticache/README.md index 55933f8..67890f6 100644 --- a/aws/elasticache/README.md +++ b/aws/elasticache/README.md @@ -24,7 +24,7 @@ Configures the following for ElastiCache clusters based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -62,42 +62,51 @@ No modules. | [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | | [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `false` | no | | [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | | [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | | [evictions\_enabled](#input\_evictions\_enabled) | Enable eviction rate monitor | `bool` | `false` | no | | [evictions\_evaluation\_window](#input\_evictions\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [evictions\_no\_data\_window](#input\_evictions\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [evictions\_threshold\_critical](#input\_evictions\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no | | [evictions\_threshold\_warning](#input\_evictions\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no | +| [evictions\_use\_message](#input\_evictions\_use\_message) | Whether to use the query alert base message for evictions monitor | `bool` | `false` | no | | [hit\_rate\_anomaly\_deviations](#input\_hit\_rate\_anomaly\_deviations) | Standard deviations | `number` | `2` | no | | [hit\_rate\_anomaly\_enabled](#input\_hit\_rate\_anomaly\_enabled) | Enable cache hit rate anomaly monitor | `bool` | `false` | no | | [hit\_rate\_anomaly\_evaluation\_window](#input\_hit\_rate\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | | [hit\_rate\_anomaly\_no\_data\_window](#input\_hit\_rate\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [hit\_rate\_anomaly\_seasonality](#input\_hit\_rate\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"daily"` | no | | [hit\_rate\_anomaly\_threshold\_critical](#input\_hit\_rate\_anomaly\_threshold\_critical) | Critical threshold (percentage) | `number` | `null` | no | +| [hit\_rate\_anomaly\_use\_message](#input\_hit\_rate\_anomaly\_use\_message) | Whether to use the query alert base message for hit rate anomaly monitor | `bool` | `false` | no | | [hit\_rate\_enabled](#input\_hit\_rate\_enabled) | Enable cache hit rate monitor | `bool` | `false` | no | | [hit\_rate\_evaluation\_window](#input\_hit\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [hit\_rate\_no\_data\_window](#input\_hit\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [hit\_rate\_threshold\_critical](#input\_hit\_rate\_threshold\_critical) | Critical threshold (percentage) | `number` | `null` | no | | [hit\_rate\_threshold\_warning](#input\_hit\_rate\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no | +| [hit\_rate\_use\_message](#input\_hit\_rate\_use\_message) | Whether to use the query alert base message for hit rate monitor | `bool` | `false` | no | | [max\_connections\_enabled](#input\_max\_connections\_enabled) | Enable max connections monitor | `bool` | `false` | no | | [max\_connections\_evaluation\_window](#input\_max\_connections\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [max\_connections\_no\_data\_window](#input\_max\_connections\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [max\_connections\_threshold\_critical](#input\_max\_connections\_threshold\_critical) | Critical threshold (connections) | `number` | `64000` | no | | [max\_connections\_threshold\_warning](#input\_max\_connections\_threshold\_warning) | Warning threshold (connections) | `number` | `60000` | no | +| [max\_connections\_use\_message](#input\_max\_connections\_use\_message) | Whether to use the query alert base message for max connections monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | @@ -108,6 +117,7 @@ No modules. | [swap\_usage\_no\_data\_window](#input\_swap\_usage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [swap\_usage\_threshold\_critical](#input\_swap\_usage\_threshold\_critical) | Critical threshold (bytes) | `number` | `52428800` | no | | [swap\_usage\_threshold\_warning](#input\_swap\_usage\_threshold\_warning) | Warning threshold (bytes) | `number` | `null` | no | +| [swap\_usage\_use\_message](#input\_swap\_usage\_use\_message) | Whether to use the query alert base message for swap usage monitor | `bool` | `false` | no | | [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | | [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no | diff --git a/aws/elasticache/main.tf b/aws/elasticache/main.tf index 3f7c8a5..2ad69b1 100644 --- a/aws/elasticache/main.tf +++ b/aws/elasticache/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "cpu_utilization" { count = var.cpu_utilization_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache CPU Utilization - {{cacheclusterid.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,7 +27,7 @@ resource "datadog_monitor" "cpu_utilization" { query = <= ${var.cpu_utilization_threshold_critical} END @@ -41,8 +41,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache CPU utilization anomalous activity - {{cacheclusterid.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -56,7 +56,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { query = <= ${var.cpu_utilization_anomaly_threshold_critical} @@ -71,8 +71,8 @@ resource "datadog_monitor" "evictions" { count = var.evictions_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache evictions - {{cacheclusterid.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.evictions_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -86,7 +86,7 @@ resource "datadog_monitor" "evictions" { query = <= ${var.evictions_threshold_critical} END @@ -100,8 +100,8 @@ resource "datadog_monitor" "hit_rate" { count = var.hit_rate_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache cache hit rate - {{cacheclusterid.name}} - {{value}}% ", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.hit_rate_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -115,7 +115,7 @@ resource "datadog_monitor" "hit_rate" { query = <= ${var.hit_rate_threshold_critical} END @@ -129,8 +129,8 @@ resource "datadog_monitor" "hit_rate_anomaly" { count = var.hit_rate_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache cache hit rate anomalous activity - {{cacheclusterid.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.hit_rate_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -144,7 +144,7 @@ resource "datadog_monitor" "hit_rate_anomaly" { query = <= ${var.hit_rate_anomaly_threshold_critical} @@ -159,8 +159,8 @@ resource "datadog_monitor" "max_connections" { count = var.max_connections_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache max connections reached - {{cacheclusterid.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.max_connections_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -174,7 +174,7 @@ resource "datadog_monitor" "max_connections" { query = <= ${var.max_connections_threshold_critical} END @@ -188,8 +188,8 @@ resource "datadog_monitor" "swap_usage" { count = var.swap_usage_enabled ? 1 : 0 name = join("", [local.title_prefix, "Elasticache swap usage - {{cacheclusterid.name}} - {{value}}MB", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.swap_usage_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -203,7 +203,7 @@ resource "datadog_monitor" "swap_usage" { query = < [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -45,12 +45,14 @@ No modules. | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:elasticsearch"
]
| no | -| [cluster\_health\_red\_enabled](#input\_cluster\_health\_red\_enabled) | Enable cluster health\_red monitor | `bool` | `false` | no | +| [cluster\_health\_red\_enabled](#input\_cluster\_health\_red\_enabled) | Enable cluster health\_red monitor | `bool` | `true` | no | | [cluster\_health\_red\_evaluation\_window](#input\_cluster\_health\_red\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cluster\_health\_red\_no\_data\_window](#input\_cluster\_health\_red\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [cluster\_health\_yellow\_enabled](#input\_cluster\_health\_yellow\_enabled) | Enable cluster health monitor | `bool` | `false` | no | +| [cluster\_health\_red\_use\_message](#input\_cluster\_health\_red\_use\_message) | Whether to use the query alert base message for cluster health red monitor | `bool` | `true` | no | +| [cluster\_health\_yellow\_enabled](#input\_cluster\_health\_yellow\_enabled) | Enable cluster health monitor | `bool` | `true` | no | | [cluster\_health\_yellow\_evaluation\_window](#input\_cluster\_health\_yellow\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cluster\_health\_yellow\_no\_data\_window](#input\_cluster\_health\_yellow\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [cluster\_health\_yellow\_use\_message](#input\_cluster\_health\_yellow\_use\_message) | Whether to use the query alert base message for cluster health yellow monitor | `bool` | `false` | no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [cpu\_utilization\_anomaly\_deviations](#input\_cpu\_utilization\_anomaly\_deviations) | Standard deviations | `number` | `4` | no | | [cpu\_utilization\_anomaly\_enabled](#input\_cpu\_utilization\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `false` | no | @@ -62,26 +64,32 @@ No modules. | [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | | [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | | [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `false` | no | | [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `0.9` | no | | [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `0.8` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [free\_storage\_enabled](#input\_free\_storage\_enabled) | Enable free storage monitor | `bool` | `false` | no | +| [free\_storage\_enabled](#input\_free\_storage\_enabled) | Enable free storage monitor | `bool` | `true` | no | | [free\_storage\_evaluation\_window](#input\_free\_storage\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [free\_storage\_no\_data\_window](#input\_free\_storage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [free\_storage\_threshold\_critical](#input\_free\_storage\_threshold\_critical) | Critical threshold (GB) | `number` | `null` | no | -| [free\_storage\_threshold\_warning](#input\_free\_storage\_threshold\_warning) | Warning threshold (GB) | `number` | `null` | no | +| [free\_storage\_threshold\_critical](#input\_free\_storage\_threshold\_critical) | Critical threshold for used disk space (%) | `number` | `90` | no | +| [free\_storage\_threshold\_warning](#input\_free\_storage\_threshold\_warning) | Warning threshold for used disk space (%) | `number` | `80` | no | +| [free\_storage\_use\_message](#input\_free\_storage\_use\_message) | Whether to use the query alert base message for free storage monitor | `bool` | `true` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/elasticsearch/main.tf b/aws/elasticsearch/main.tf index 632e503..479754c 100644 --- a/aws/elasticsearch/main.tf +++ b/aws/elasticsearch/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "cluster_health_red" { count = var.cluster_health_red_enabled ? 1 : 0 name = join("", [local.title_prefix, "ElasticSearch cluster health red - {{name.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cluster_health_red_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,7 +27,7 @@ resource "datadog_monitor" "cluster_health_red" { query = <= 1 END @@ -40,8 +40,8 @@ resource "datadog_monitor" "cluster_health_yellow" { count = var.cluster_health_yellow_enabled ? 1 : 0 name = join("", [local.title_prefix, "ElasticSearch cluster health yellow - {{name.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cluster_health_yellow_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -55,7 +55,7 @@ resource "datadog_monitor" "cluster_health_yellow" { query = <= 1 END @@ -68,8 +68,8 @@ resource "datadog_monitor" "cpu_utilization" { count = var.cpu_utilization_enabled ? 1 : 0 name = join("", [local.title_prefix, "ElasticSearch CPU Utilization - {{name.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -83,7 +83,7 @@ resource "datadog_monitor" "cpu_utilization" { query = <= ${var.cpu_utilization_threshold_critical} END @@ -97,8 +97,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "ElasticSearch CPU utilization anomalous activity - {{name.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -133,8 +133,8 @@ resource "datadog_monitor" "free_storage" { count = var.free_storage_enabled ? 1 : 0 name = join("", [local.title_prefix, "ElasticSearch cluster storage - {{name.name}} - {{value}}% used", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.free_storage_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -148,9 +148,9 @@ resource "datadog_monitor" "free_storage" { query = < ${var.free_storage_threshold_critical} EOQ diff --git a/aws/elasticsearch/variables.tf b/aws/elasticsearch/variables.tf index 971cdd4..d251705 100644 --- a/aws/elasticsearch/variables.tf +++ b/aws/elasticsearch/variables.tf @@ -17,7 +17,7 @@ variable "base_tags" { # ElasticSearch cluster health (red) ######################################## variable "cluster_health_red_enabled" { - default = false + default = true description = "Enable cluster health_red monitor" type = bool } @@ -34,11 +34,17 @@ variable "cluster_health_red_no_data_window" { type = number } +variable "cluster_health_red_use_message" { + description = "Whether to use the query alert base message for cluster health red monitor" + type = bool + default = true +} + ####################################### # ElasticSearch cluster health (yellow) ######################################## variable "cluster_health_yellow_enabled" { - default = false + default = true description = "Enable cluster health monitor" type = bool } @@ -55,11 +61,17 @@ variable "cluster_health_yellow_no_data_window" { type = number } +variable "cluster_health_yellow_use_message" { + description = "Whether to use the query alert base message for cluster health yellow monitor" + type = bool + default = false +} + ######################################## # Node CPU Utilization ######################################## variable "cpu_utilization_enabled" { - default = false + default = true description = "Enable CPU utilization monitor" type = bool } @@ -88,6 +100,12 @@ variable "cpu_utilization_threshold_warning" { type = number } +variable "cpu_utilization_use_message" { + description = "Whether to use the query alert base message for CPU utilization monitor" + type = bool + default = false +} + ######################################## # CPU Utilization (anomaly detection) ######################################## @@ -151,6 +169,12 @@ variable "cpu_utilization_anomaly_threshold_warning" { type = number } +variable "cpu_utilization_anomaly_use_message" { + description = "Whether to use the query alert base message for CPU utilization anomaly monitor" + type = bool + default = false +} + ######################################## # ElasticSearch cluster free storage ######################################## @@ -173,13 +197,19 @@ variable "free_storage_evaluation_window" { } variable "free_storage_threshold_critical" { - default = null - description = "Critical threshold (GB)" + default = 90 + description = "Critical threshold for used disk space (%)" type = number } variable "free_storage_threshold_warning" { - default = null - description = "Warning threshold (GB)" + default = 80 + description = "Warning threshold for used disk space (%)" type = number } + +variable "free_storage_use_message" { + description = "Whether to use the query alert base message for free storage monitor" + type = bool + default = true +} diff --git a/aws/elb/README.md b/aws/elb/README.md index 9063d12..a0edca2 100644 --- a/aws/elb/README.md +++ b/aws/elb/README.md @@ -20,7 +20,7 @@ Configures the following for Classic ELBs based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -30,8 +30,8 @@ No modules. | Name | Type | |------|------| +| [datadog_monitor.http_5xx_backend_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | | [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | | [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | | [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | @@ -43,37 +43,45 @@ No modules. | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | -| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | +| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:lb"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | +| [http\_5xx\_backend\_responses\_enabled](#input\_http\_5xx\_backend\_responses\_enabled) | Enable HTTP 5xx response monitor (backend) | `bool` | `false` | no | +| [http\_5xx\_backend\_responses\_evaluation\_window](#input\_http\_5xx\_backend\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [http\_5xx\_backend\_responses\_no\_data\_window](#input\_http\_5xx\_backend\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [http\_5xx\_backend\_responses\_threshold\_critical](#input\_http\_5xx\_backend\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | +| [http\_5xx\_backend\_responses\_threshold\_warning](#input\_http\_5xx\_backend\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [http\_5xx\_backend\_responses\_use\_message](#input\_http\_5xx\_backend\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx backend responses monitor | `bool` | `false` | no | | [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | | [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | | [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | -| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message for HTTP 5xx responses monitor | `bool` | `false` | no | | [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | | [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | | [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message for latency monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | | [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no | | [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | | [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no | +| [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Warning threshold (percentage) | `number` | `0` | no | +| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no | +| [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message for no healthy instances monitor | `bool` | `true` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | diff --git a/aws/elb/main.tf b/aws/elb/main.tf index 182c7e2..dfce887 100644 --- a/aws/elb/main.tf +++ b/aws/elb/main.tf @@ -4,16 +4,16 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } resource "datadog_monitor" "http_5xx_responses" { count = var.http_5xx_responses_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ELB 5xx Responses - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ELB 5xx Responses - {{loadbalancername.name}}", local.title_suffix]) + include_tags = false + message = var.http_5xx_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" { query = < ${var.http_5xx_responses_threshold_critical} END @@ -41,9 +41,9 @@ END resource "datadog_monitor" "http_5xx_backend_responses" { count = var.http_5xx_backend_responses_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ELB Backend 5xx Responses - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ELB Backend 5xx Responses - {{loadbalancername.name}}", local.title_suffix]) + include_tags = false + message = var.http_5xx_backend_responses_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_backend_responses" { query = < ${var.http_5xx_backend_responses_threshold_critical} END @@ -72,9 +72,9 @@ END resource "datadog_monitor" "latency" { count = var.latency_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ELB backend latency - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ELB backend latency - {{loadbalancername.name}}", local.title_suffix]) + include_tags = false + message = var.latency_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" { query = < ${var.latency_threshold_critical} END @@ -101,9 +101,9 @@ END resource "datadog_monitor" "no_healthy_instances" { count = var.no_healthy_instances_enabled ? 1 : 0 - name = join("", [local.title_prefix, "ALB healthy instances - {{host.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + name = join("", [local.title_prefix, "ALB healthy instances - {{loadbalancername.name}}", local.title_suffix]) + include_tags = false + message = var.no_healthy_instances_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" { query = < [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -33,10 +33,13 @@ No modules. | Name | Type | |------|------| -| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.cold_starts](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.error_rate](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.iterator_age](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.iterator_age_forecast](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.out_of_memory](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.throttle_rate](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.timeouts](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | ## Inputs @@ -46,44 +49,68 @@ No modules. | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | -| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | +| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:lambda"
]
| no | +| [cold\_starts\_enabled](#input\_cold\_starts\_enabled) | Enable cold starts monitor (requires enhanced metrics) | `bool` | `false` | no | +| [cold\_starts\_evaluation\_window](#input\_cold\_starts\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_4h"` | no | +| [cold\_starts\_no\_data\_window](#input\_cold\_starts\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no | +| [cold\_starts\_threshold\_critical](#input\_cold\_starts\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no | +| [cold\_starts\_threshold\_warning](#input\_cold\_starts\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no | +| [cold\_starts\_use\_message](#input\_cold\_starts\_use\_message) | Whether to use the query alert base message for cold starts monitor | `bool` | `false` | no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | +| [error\_rate\_enabled](#input\_error\_rate\_enabled) | Enable Lambda error rate monitor | `bool` | `true` | no | +| [error\_rate\_evaluation\_window](#input\_error\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [error\_rate\_no\_data\_window](#input\_error\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [error\_rate\_threshold\_critical](#input\_error\_rate\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | +| [error\_rate\_threshold\_warning](#input\_error\_rate\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [error\_rate\_use\_message](#input\_error\_rate\_use\_message) | Whether to use the query alert base message for error rate monitor | `bool` | `true` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | -| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | -| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | -| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | -| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [iterator\_age\_enabled](#input\_iterator\_age\_enabled) | Enable iterator age monitor | `bool` | `false` | no | +| [iterator\_age\_evaluation\_window](#input\_iterator\_age\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [iterator\_age\_forecast\_enabled](#input\_iterator\_age\_forecast\_enabled) | Enable iterator age monitor | `bool` | `false` | no | +| [iterator\_age\_forecast\_evaluation\_window](#input\_iterator\_age\_forecast\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1d"` | no | +| [iterator\_age\_forecast\_no\_data\_window](#input\_iterator\_age\_forecast\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no | +| [iterator\_age\_forecast\_use\_message](#input\_iterator\_age\_forecast\_use\_message) | Whether to use the query alert base message for iterator age forecast monitor | `bool` | `false` | no | +| [iterator\_age\_no\_data\_window](#input\_iterator\_age\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no | +| [iterator\_age\_threshold\_critical](#input\_iterator\_age\_threshold\_critical) | Critical threshold (milliseconds) | `number` | `86400000` | no | +| [iterator\_age\_threshold\_warning](#input\_iterator\_age\_threshold\_warning) | Warning threshold (milliseconds) | `number` | `null` | no | +| [iterator\_age\_use\_message](#input\_iterator\_age\_use\_message) | Whether to use the query alert base message for iterator age monitor | `bool` | `false` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | -| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no | -| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [out\_of\_memory\_enabled](#input\_out\_of\_memory\_enabled) | Enable out of memory monitor (requires enhanced metrics) | `bool` | `true` | no | +| [out\_of\_memory\_evaluation\_window](#input\_out\_of\_memory\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_4h"` | no | +| [out\_of\_memory\_no\_data\_window](#input\_out\_of\_memory\_no\_data\_window) | No data threshold (in minutes, null to disable) | `number` | `null` | no | +| [out\_of\_memory\_threshold\_critical](#input\_out\_of\_memory\_threshold\_critical) | Critical threshold (count) | `number` | `5` | no | +| [out\_of\_memory\_threshold\_warning](#input\_out\_of\_memory\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no | +| [out\_of\_memory\_use\_message](#input\_out\_of\_memory\_use\_message) | Whether to use the query alert base message for out of memory monitor | `bool` | `false` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | | [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no | | [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no | +| [throttle\_rate\_enabled](#input\_throttle\_rate\_enabled) | Enable Lambda throttle rate monitor | `bool` | `true` | no | +| [throttle\_rate\_evaluation\_window](#input\_throttle\_rate\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [throttle\_rate\_no\_data\_window](#input\_throttle\_rate\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [throttle\_rate\_threshold\_critical](#input\_throttle\_rate\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | +| [throttle\_rate\_threshold\_warning](#input\_throttle\_rate\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | +| [throttle\_rate\_use\_message](#input\_throttle\_rate\_use\_message) | Whether to use the query alert base message for throttle rate monitor | `bool` | `false` | no | | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | +| [timeouts\_enabled](#input\_timeouts\_enabled) | Enable timeout count monitor | `bool` | `true` | no | +| [timeouts\_evaluation\_window](#input\_timeouts\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [timeouts\_no\_data\_window](#input\_timeouts\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [timeouts\_threshold\_critical](#input\_timeouts\_threshold\_critical) | Critical threshold (count) | `number` | `75` | no | +| [timeouts\_threshold\_warning](#input\_timeouts\_threshold\_warning) | Warning threshold (count) | `number` | `25` | no | +| [timeouts\_use\_message](#input\_timeouts\_use\_message) | Whether to use the query alert base message for timeouts monitor | `bool` | `false` | no | | [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no | | [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no | | [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | diff --git a/aws/lambda/main.tf b/aws/lambda/main.tf index 1eb0d13..e37a8f4 100644 --- a/aws/lambda/main.tf +++ b/aws/lambda/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" cold_start_query_filter = local.query_filter == "{*}" ? "{cold_start:true}" : replace(local.query_filter, "{", "{cold_star:true,") @@ -14,8 +14,8 @@ resource "datadog_monitor" "error_rate" { count = var.error_rate_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda error rate - {{functionname.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.error_rate_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -29,8 +29,8 @@ resource "datadog_monitor" "error_rate" { query = < ${var.error_rate_threshold_critical} END @@ -44,8 +44,8 @@ resource "datadog_monitor" "timeouts" { count = var.timeouts_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda timeouts - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.timeouts_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -59,8 +59,8 @@ resource "datadog_monitor" "timeouts" { query = < ${var.timeouts_threshold_critical} END @@ -74,8 +74,8 @@ resource "datadog_monitor" "cold_starts" { count = var.cold_starts_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda cold starts - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cold_starts_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -89,8 +89,8 @@ resource "datadog_monitor" "cold_starts" { query = < ${var.cold_starts_threshold_critical} END @@ -104,8 +104,8 @@ resource "datadog_monitor" "out_of_memory" { count = var.out_of_memory_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda out of memory - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.out_of_memory_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -134,8 +134,8 @@ resource "datadog_monitor" "iterator_age" { count = var.iterator_age_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda iterator age - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.iterator_age_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -149,7 +149,7 @@ resource "datadog_monitor" "iterator_age" { query = < ${var.iterator_age_threshold_critical} END @@ -163,8 +163,8 @@ resource "datadog_monitor" "iterator_age_forecast" { count = var.iterator_age_forecast_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda stream data loss forecasted - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.iterator_age_forecast_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -191,8 +191,8 @@ resource "datadog_monitor" "throttle_rate" { count = var.throttle_rate_enabled ? 1 : 0 name = join("", [local.title_prefix, "Lambda throttle rate - {{functionname.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.throttle_rate_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" diff --git a/aws/lambda/variables.tf b/aws/lambda/variables.tf index 8aa64cc..4332d90 100644 --- a/aws/lambda/variables.tf +++ b/aws/lambda/variables.tf @@ -17,7 +17,7 @@ variable "base_tags" { # Lambda error rate ######################################## variable "error_rate_enabled" { - default = false + default = true description = "Enable Lambda error rate monitor" type = bool } @@ -46,11 +46,17 @@ variable "error_rate_threshold_warning" { type = number } +variable "error_rate_use_message" { + description = "Whether to use the query alert base message for error rate monitor" + type = bool + default = true +} + ######################################## # Lambda timeouts ######################################## variable "timeouts_enabled" { - default = false + default = true description = "Enable timeout count monitor" type = bool } @@ -79,6 +85,12 @@ variable "timeouts_threshold_warning" { type = number } +variable "timeouts_use_message" { + description = "Whether to use the query alert base message for timeouts monitor" + type = bool + default = false +} + ######################################## # Cold start monitor ######################################## @@ -112,11 +124,17 @@ variable "cold_starts_threshold_warning" { type = number } +variable "cold_starts_use_message" { + description = "Whether to use the query alert base message for cold starts monitor" + type = bool + default = false +} + ######################################## # OOM monitor ######################################## variable "out_of_memory_enabled" { - default = false + default = true description = "Enable out of memory monitor (requires enhanced metrics)" type = bool } @@ -134,7 +152,7 @@ variable "out_of_memory_no_data_window" { } variable "out_of_memory_threshold_critical" { - default = null + default = 5 description = "Critical threshold (count)" type = number } @@ -145,6 +163,12 @@ variable "out_of_memory_threshold_warning" { type = number } +variable "out_of_memory_use_message" { + description = "Whether to use the query alert base message for out of memory monitor" + type = bool + default = false +} + ######################################## # Iterator Age monitor ######################################## @@ -178,6 +202,12 @@ variable "iterator_age_threshold_warning" { type = number } +variable "iterator_age_use_message" { + description = "Whether to use the query alert base message for iterator age monitor" + type = bool + default = false +} + ######################################## # Iterator Age forecast data loss ######################################## @@ -199,11 +229,17 @@ variable "iterator_age_forecast_no_data_window" { type = number } +variable "iterator_age_forecast_use_message" { + description = "Whether to use the query alert base message for iterator age forecast monitor" + type = bool + default = false +} + ######################################## # Lambda throttle rate ######################################## variable "throttle_rate_enabled" { - default = false + default = true description = "Enable Lambda throttle rate monitor" type = bool } @@ -231,3 +267,9 @@ variable "throttle_rate_threshold_warning" { description = "Warning threshold (percentage, 0-100)" type = number } + +variable "throttle_rate_use_message" { + description = "Whether to use the query alert base message for throttle rate monitor" + type = bool + default = false +} diff --git a/aws/rds/README.md b/aws/rds/README.md index cc05203..130995c 100644 --- a/aws/rds/README.md +++ b/aws/rds/README.md @@ -21,7 +21,7 @@ Configures the following for RDS databases based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.37.0 | ## Modules @@ -31,10 +31,10 @@ No modules. | Name | Type | |------|------| -| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.connection_count_anomaly](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.cpu_utilization](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.cpu_utilization_anomaly](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.used_storage](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | ## Inputs @@ -44,37 +44,49 @@ No modules. | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | -| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | +| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:rds"
]
| no | +| [connection\_count\_anomaly\_deviations](#input\_connection\_count\_anomaly\_deviations) | Standard deviations | `number` | `3` | no | +| [connection\_count\_anomaly\_enabled](#input\_connection\_count\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `true` | no | +| [connection\_count\_anomaly\_evaluation\_window](#input\_connection\_count\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [connection\_count\_anomaly\_no\_data\_window](#input\_connection\_count\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [connection\_count\_anomaly\_recovery\_window](#input\_connection\_count\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | +| [connection\_count\_anomaly\_rollup](#input\_connection\_count\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no | +| [connection\_count\_anomaly\_seasonality](#input\_connection\_count\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no | +| [connection\_count\_anomaly\_threshold\_critical](#input\_connection\_count\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `0.75` | no | +| [connection\_count\_anomaly\_threshold\_warning](#input\_connection\_count\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | +| [connection\_count\_anomaly\_trigger\_window](#input\_connection\_count\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [connection\_count\_anomaly\_use\_message](#input\_connection\_count\_anomaly\_use\_message) | Whether to use the query alert base message for connection count anomaly monitor | `bool` | `true` | no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | +| [cpu\_utilization\_anomaly\_deviations](#input\_cpu\_utilization\_anomaly\_deviations) | Standard deviations | `number` | `4` | no | +| [cpu\_utilization\_anomaly\_enabled](#input\_cpu\_utilization\_anomaly\_enabled) | Enable CPU utilization anomaly monitor | `bool` | `false` | no | +| [cpu\_utilization\_anomaly\_evaluation\_window](#input\_cpu\_utilization\_anomaly\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_no\_data\_window](#input\_cpu\_utilization\_anomaly\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [cpu\_utilization\_anomaly\_recovery\_window](#input\_cpu\_utilization\_anomaly\_recovery\_window) | Recovery window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | +| [cpu\_utilization\_anomaly\_rollup](#input\_cpu\_utilization\_anomaly\_rollup) | Rollup interval (must be sized based on evaluation window/span and seasonaility) | `number` | `60` | no | +| [cpu\_utilization\_anomaly\_seasonality](#input\_cpu\_utilization\_anomaly\_seasonality) | Seasonaility (hourly, daily, weekly) | `string` | `"weekly"` | no | +| [cpu\_utilization\_anomaly\_threshold\_critical](#input\_cpu\_utilization\_anomaly\_threshold\_critical) | Critical threshold (percent) | `number` | `null` | no | +| [cpu\_utilization\_anomaly\_threshold\_warning](#input\_cpu\_utilization\_anomaly\_threshold\_warning) | Warning threshold (percent) | `number` | `null` | no | +| [cpu\_utilization\_anomaly\_trigger\_window](#input\_cpu\_utilization\_anomaly\_trigger\_window) | Trigger window for anomaly monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_1h"` | no | +| [cpu\_utilization\_anomaly\_use\_message](#input\_cpu\_utilization\_anomaly\_use\_message) | Whether to use the query alert base message for CPU utilization anomaly monitor | `bool` | `false` | no | +| [cpu\_utilization\_enabled](#input\_cpu\_utilization\_enabled) | Enable CPU utilization monitor | `bool` | `true` | no | +| [cpu\_utilization\_evaluation\_window](#input\_cpu\_utilization\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [cpu\_utilization\_no\_data\_window](#input\_cpu\_utilization\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [cpu\_utilization\_threshold\_critical](#input\_cpu\_utilization\_threshold\_critical) | Critical threshold (percent) | `number` | `90` | no | +| [cpu\_utilization\_threshold\_warning](#input\_cpu\_utilization\_threshold\_warning) | Warning threshold (percent) | `number` | `80` | no | +| [cpu\_utilization\_use\_message](#input\_cpu\_utilization\_use\_message) | Whether to use the query alert base message for CPU utilization monitor | `bool` | `false` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | -| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | -| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | -| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | -| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | -| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no | -| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | @@ -84,6 +96,12 @@ No modules. | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | | [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no | | [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no | +| [used\_storage\_enabled](#input\_used\_storage\_enabled) | Enable used storage monitor | `bool` | `true` | no | +| [used\_storage\_evaluation\_window](#input\_used\_storage\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no | +| [used\_storage\_no\_data\_window](#input\_used\_storage\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [used\_storage\_threshold\_critical](#input\_used\_storage\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `90` | no | +| [used\_storage\_threshold\_warning](#input\_used\_storage\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `80` | no | +| [used\_storage\_use\_message](#input\_used\_storage\_use\_message) | Whether to use the query alert base message for used storage monitor | `bool` | `true` | no | | [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | ## Outputs diff --git a/aws/rds/main.tf b/aws/rds/main.tf index bbb3292..c64956c 100644 --- a/aws/rds/main.tf +++ b/aws/rds/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "connection_count_anomaly" { count = var.connection_count_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "RDS connection count anomalous activity - {{dbinstanceidentifier.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.connection_count_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,7 +27,7 @@ resource "datadog_monitor" "connection_count_anomaly" { query = <= ${var.connection_count_anomaly_threshold_critical} @@ -48,8 +48,8 @@ resource "datadog_monitor" "cpu_utilization" { count = var.cpu_utilization_enabled ? 1 : 0 name = join("", [local.title_prefix, "RDS CPU Utilization - {{dbinstanceidentifier.name}} - {{value}}%", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -63,7 +63,7 @@ resource "datadog_monitor" "cpu_utilization" { query = <= ${var.cpu_utilization_threshold_critical} END @@ -77,8 +77,8 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { count = var.cpu_utilization_anomaly_enabled ? 1 : 0 name = join("", [local.title_prefix, "RDS CPU utilization anomalous activity - {{dbinstanceidentifier.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.cpu_utilization_anomaly_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -92,7 +92,7 @@ resource "datadog_monitor" "cpu_utilization_anomaly" { query = <= ${var.cpu_utilization_anomaly_threshold_critical} @@ -113,8 +113,8 @@ resource "datadog_monitor" "used_storage" { count = var.used_storage_enabled ? 1 : 0 name = join("", [local.title_prefix, "RDS instance storage - {{dbinstanceidentifier.name}} - {{value}}% used", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.used_storage_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -129,8 +129,8 @@ resource "datadog_monitor" "used_storage" { query = <= ${var.used_storage_threshold_critical} END diff --git a/aws/rds/variables.tf b/aws/rds/variables.tf index 6e74aa4..64f2191 100644 --- a/aws/rds/variables.tf +++ b/aws/rds/variables.tf @@ -17,7 +17,7 @@ variable "base_tags" { # Connection Rate (anomaly detection) ######################################## variable "connection_count_anomaly_enabled" { - default = false + default = true description = "Enable CPU utilization anomaly monitor" type = bool } @@ -65,7 +65,7 @@ variable "connection_count_anomaly_trigger_window" { } variable "connection_count_anomaly_threshold_critical" { - default = null + default = 0.75 description = "Critical threshold (percent)" type = number } @@ -76,11 +76,17 @@ variable "connection_count_anomaly_threshold_warning" { type = number } +variable "connection_count_anomaly_use_message" { + description = "Whether to use the query alert base message for connection count anomaly monitor" + type = bool + default = true +} + ######################################## # Node CPU Utilization ######################################## variable "cpu_utilization_enabled" { - default = false + default = true description = "Enable CPU utilization monitor" type = bool } @@ -109,6 +115,12 @@ variable "cpu_utilization_threshold_warning" { type = number } +variable "cpu_utilization_use_message" { + description = "Whether to use the query alert base message for CPU utilization monitor" + type = bool + default = false +} + ######################################## # CPU Utilization (anomaly detection) ######################################## @@ -172,6 +184,12 @@ variable "cpu_utilization_anomaly_threshold_warning" { type = number } +variable "cpu_utilization_anomaly_use_message" { + description = "Whether to use the query alert base message for CPU utilization anomaly monitor" + type = bool + default = false +} + ######################################## # ElasticSearch cluster used storage ######################################## @@ -204,3 +222,9 @@ variable "used_storage_threshold_warning" { description = "Warning threshold (percentage, 0-100)" type = number } + +variable "used_storage_use_message" { + description = "Whether to use the query alert base message for used storage monitor" + type = bool + default = true +} diff --git a/aws/sqs/.terraform.lock.hcl b/aws/sqs/.terraform.lock.hcl index 5fa8913..f4429ee 100644 --- a/aws/sqs/.terraform.lock.hcl +++ b/aws/sqs/.terraform.lock.hcl @@ -5,6 +5,7 @@ provider "registry.terraform.io/datadog/datadog" { version = "3.44.0" constraints = ">= 3.37.0" hashes = [ + "h1:gapxzCRcnTGm4HLO1zuoelGC15+0LEYceGNWGh69JLE=", "h1:neJ/si/8CotiW8ulfjU6dFmb1bpzbTjhfHLTlCvdynw=", "zh:12119fe0cafbe7e05c32d4101a804d479ae756e19512c789c67cb3c51420ac98", "zh:35267ecc27de00e449893df9a37481f38b8fe24d14fe94198cd68966f1aa586f", @@ -27,6 +28,7 @@ provider "registry.terraform.io/hashicorp/null" { version = "3.2.2" constraints = ">= 3.1.0" hashes = [ + "h1:IMVAUHKoydFrlPrl9OzasDnw/8ntZFerCC9iXw1rXQY=", "h1:vWAsYRd7MjYr3adj8BVKRohVfHpWQdvkIwUQ2Jf5FVM=", "zh:3248aae6a2198f3ec8394218d05bd5e42be59f43a3a7c0b71c66ec0df08b69e7", "zh:32b1aaa1c3013d33c245493f4a65465eab9436b454d250102729321a44c8ab9a", diff --git a/aws/sqs/README.md b/aws/sqs/README.md index 78b8d6e..2d27fa4 100644 --- a/aws/sqs/README.md +++ b/aws/sqs/README.md @@ -18,7 +18,7 @@ Configures the following for Lambda functions based on tag matches: | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.44.0 | ## Modules @@ -28,10 +28,8 @@ No modules. | Name | Type | |------|------| -| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.oldest_message](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.queue_depth](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | ## Inputs @@ -41,39 +39,35 @@ No modules. | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | -| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | +| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:queue"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | -| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | -| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | -| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | -| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | -| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no | -| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [oldest\_message\_enabled](#input\_oldest\_message\_enabled) | Enable oldest queued message monitor | `bool` | `false` | no | +| [oldest\_message\_evaluation\_window](#input\_oldest\_message\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [oldest\_message\_no\_data\_window](#input\_oldest\_message\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [oldest\_message\_threshold\_critical](#input\_oldest\_message\_threshold\_critical) | Critical threshold (seconds) | `number` | `75` | no | +| [oldest\_message\_threshold\_warning](#input\_oldest\_message\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | +| [oldest\_message\_use\_message](#input\_oldest\_message\_use\_message) | Whether to use the query alert base message for oldest message monitor | `bool` | `false` | no | +| [queue\_depth\_enabled](#input\_queue\_depth\_enabled) | Enable queue depth count monitor | `bool` | `false` | no | +| [queue\_depth\_evaluation\_window](#input\_queue\_depth\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [queue\_depth\_no\_data\_window](#input\_queue\_depth\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | +| [queue\_depth\_threshold\_critical](#input\_queue\_depth\_threshold\_critical) | Critical threshold (count) | `number` | `null` | no | +| [queue\_depth\_threshold\_warning](#input\_queue\_depth\_threshold\_warning) | Warning threshold (count) | `number` | `null` | no | +| [queue\_depth\_use\_message](#input\_queue\_depth\_use\_message) | Whether to use the query alert base message for queue depth monitor | `bool` | `false` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | | [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no | | [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no | diff --git a/aws/sqs/main.tf b/aws/sqs/main.tf index edbfc91..6c98447 100644 --- a/aws/sqs/main.tf +++ b/aws/sqs/main.tf @@ -4,7 +4,7 @@ locals { monitor_warn_default_priority = null monitor_nodata_default_priority = null - title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}" + title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]" title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})" } @@ -12,8 +12,8 @@ resource "datadog_monitor" "oldest_message" { count = var.oldest_message_enabled ? 1 : 0 name = join("", [local.title_prefix, "Oldest queued message - {{queuename.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.oldest_message_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,7 +27,7 @@ resource "datadog_monitor" "oldest_message" { query = < ${var.oldest_message_threshold_critical} END @@ -41,8 +41,8 @@ resource "datadog_monitor" "queue_depth" { count = var.queue_depth_enabled ? 1 : 0 name = join("", [local.title_prefix, "Queue depth - {{queuename.name}}", local.title_suffix]) - include_tags = true - message = local.query_alert_base_message + include_tags = false + message = var.queue_depth_use_message ? local.query_alert_base_message : "" tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -56,7 +56,7 @@ resource "datadog_monitor" "queue_depth" { query = < ${var.queue_depth_threshold_critical} END diff --git a/aws/sqs/variables.tf b/aws/sqs/variables.tf index 0a4b1c5..4bb5de0 100644 --- a/aws/sqs/variables.tf +++ b/aws/sqs/variables.tf @@ -46,6 +46,12 @@ variable "oldest_message_threshold_warning" { type = number } +variable "oldest_message_use_message" { + description = "Whether to use the query alert base message for oldest message monitor" + type = bool + default = false +} + ######################################## # Lambda queue_depth ######################################## @@ -78,3 +84,9 @@ variable "queue_depth_threshold_warning" { description = "Warning threshold (count)" type = number } + +variable "queue_depth_use_message" { + description = "Whether to use the query alert base message for queue depth monitor" + type = bool + default = false +} diff --git a/aws/vpn/.terraform.lock.hcl b/aws/vpn/.terraform.lock.hcl index 5fa8913..f4429ee 100644 --- a/aws/vpn/.terraform.lock.hcl +++ b/aws/vpn/.terraform.lock.hcl @@ -5,6 +5,7 @@ provider "registry.terraform.io/datadog/datadog" { version = "3.44.0" constraints = ">= 3.37.0" hashes = [ + "h1:gapxzCRcnTGm4HLO1zuoelGC15+0LEYceGNWGh69JLE=", "h1:neJ/si/8CotiW8ulfjU6dFmb1bpzbTjhfHLTlCvdynw=", "zh:12119fe0cafbe7e05c32d4101a804d479ae756e19512c789c67cb3c51420ac98", "zh:35267ecc27de00e449893df9a37481f38b8fe24d14fe94198cd68966f1aa586f", @@ -27,6 +28,7 @@ provider "registry.terraform.io/hashicorp/null" { version = "3.2.2" constraints = ">= 3.1.0" hashes = [ + "h1:IMVAUHKoydFrlPrl9OzasDnw/8ntZFerCC9iXw1rXQY=", "h1:vWAsYRd7MjYr3adj8BVKRohVfHpWQdvkIwUQ2Jf5FVM=", "zh:3248aae6a2198f3ec8394218d05bd5e42be59f43a3a7c0b71c66ec0df08b69e7", "zh:32b1aaa1c3013d33c245493f4a65465eab9436b454d250102729321a44c8ab9a", diff --git a/aws/vpn/README.md b/aws/vpn/README.md index 06a3bb5..662a44a 100644 --- a/aws/vpn/README.md +++ b/aws/vpn/README.md @@ -15,7 +15,7 @@ Configures up/down monitoring for VPN tunnels | Name | Version | |------|---------| -| [datadog](#provider\_datadog) | >= 3.37 | +| [datadog](#provider\_datadog) | 3.44.0 | ## Modules @@ -25,10 +25,7 @@ No modules. | Name | Type | |------|------| -| [datadog_monitor.http_5xx_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.http_5xx_tg_responses](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.latency](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | -| [datadog_monitor.no_healthy_instances](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | +| [datadog_monitor.tunnel_state](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource | ## Inputs @@ -38,37 +35,21 @@ No modules. | [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | | [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no | | [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | -| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:alb"
]
| no | +| [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` |
[
"resource:vpn"
]
| no | | [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no | | [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no | -| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes | +| [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no | | [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no | -| [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no | -| [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no | -| [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no | -| [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no | -| [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no | -| [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no | -| [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no | | [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no | | [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no | -| [no\_healthy\_instances\_enabled](#input\_no\_healthy\_instances\_enabled) | Enable no healthy instances monitor | `bool` | `true` | no | -| [no\_healthy\_instances\_evaluation\_window](#input\_no\_healthy\_instances\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | -| [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | -| [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage, 0 to disable) | `number` | `0` | no | | [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes | | [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no | | [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | +| [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no | | [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no | @@ -78,6 +59,9 @@ No modules. | [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no | | [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no | | [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no | +| [tunnel\_state\_enabled](#input\_tunnel\_state\_enabled) | Enable VPN tunnel state monitor | `bool` | `false` | no | +| [tunnel\_state\_evaluation\_window](#input\_tunnel\_state\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no | +| [tunnel\_state\_no\_data\_window](#input\_tunnel\_state\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no | | [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no | ## Outputs diff --git a/aws/vpn/main.tf b/aws/vpn/main.tf index 304e91b..bd4df6a 100644 --- a/aws/vpn/main.tf +++ b/aws/vpn/main.tf @@ -12,7 +12,7 @@ resource "datadog_monitor" "tunnel_state" { count = var.tunnel_state_enabled ? 1 : 0 name = join("", [local.title_prefix, "VPN tunnel state - {{host.name}}", local.title_suffix]) - include_tags = true + include_tags = false message = local.query_alert_base_message tags = concat(local.common_tags, var.base_tags, var.additional_tags) type = "query alert" @@ -27,7 +27,7 @@ resource "datadog_monitor" "tunnel_state" { query = <