Skip to content

Commit b2c497c

Browse files
authored
Merge pull request #18 from rhythmictech/ENG-4479
make the monitor queries grouping variable
2 parents 25e5f26 + a38fa1c commit b2c497c

File tree

19 files changed

+102
-62
lines changed

19 files changed

+102
-62
lines changed

common/common.tf

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,12 @@ variable "notify_prod_override" {
179179
type = list(string)
180180
}
181181

182+
variable "group_by" {
183+
default = ["name", "aws_account", "env", "datadog_managed"]
184+
description = "List of tags to group by"
185+
type = list(string)
186+
}
187+
182188
locals {
183189

184190
# tag related locals
@@ -336,4 +342,7 @@ ${local.alert_context}
336342
{{#is_alert}} ${local.notify_on_alert} {{/is_alert}}
337343
{{#is_recovery}} ${local.notify_on_recovery} {{/is_recovery}}
338344
END
345+
346+
service_group_by = join(",", formatlist("\"%s\"", var.group_by))
347+
query_group_by = join(",", var.group_by)
339348
}

host/agent/README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,22 +35,27 @@ No modules.
3535
| <a name="input_alert_critical_priority"></a> [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
3636
| <a name="input_alert_message"></a> [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
3737
| <a name="input_alert_nodata_priority"></a> [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
38-
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br> "resource:apigateway"<br>]</pre> | no |
38+
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br/> "resource:ec2"<br/>]</pre> | no |
3939
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
4040
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
41-
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
41+
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
4242
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
43+
| <a name="input_group_by"></a> [group\_by](#input\_group\_by) | List of tags to group by | `list(string)` | <pre>[<br/> "name",<br/> "aws_account",<br/> "env",<br/> "datadog_managed"<br/>]</pre> | no |
4344
| <a name="input_host_unreachable_enabled"></a> [host\_unreachable\_enabled](#input\_host\_unreachable\_enabled) | Flag to enable Host unreachable monitor | `bool` | `true` | no |
45+
| <a name="input_host_unreachable_use_message"></a> [host\_unreachable\_use\_message](#input\_host\_unreachable\_use\_message) | Flag to enable Host unreachable alerting | `bool` | `true` | no |
4446
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
4547
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
4648
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
4749
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
50+
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
4851
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
4952
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
5053
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
54+
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
55+
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
5156
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
5257
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
53-
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
58+
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `60` | no |
5459
| <a name="input_runbook_link"></a> [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
5560
| <a name="input_service"></a> [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
5661
| <a name="input_team"></a> [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |

host/agent/main.tf

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,22 @@ locals {
1111
resource "datadog_monitor" "host_unreachable" {
1212
count = var.host_unreachable_enabled ? 1 : 0
1313

14-
name = join("", [local.title_prefix, "Datadog Agent Status - {{name.name}}", local.title_suffix])
14+
name = join("", [local.title_prefix, "Datadog Agent Status - {{name.name}}", local.title_suffix])
1515
include_tags = false
1616
message = var.host_unreachable_use_message ? local.query_alert_base_message : ""
17-
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
18-
type = "service check"
17+
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
18+
type = "service check"
1919

2020
evaluation_delay = var.evaluation_delay
2121
new_group_delay = var.new_group_delay
22-
no_data_timeframe = "5"
22+
no_data_timeframe = "5"
2323
notify_no_data = true
2424
renotify_interval = var.renotify_interval
2525
require_full_window = true
2626
timeout_h = var.timeout_h
2727

2828
query = <<EOQ
29-
"datadog.agent.up"${local.service_filter}.by("name","aws_account","env","datadog_managed").last(2).count_by_status()
29+
"datadog.agent.up"${local.service_filter}.by(${local.service_group_by}).last(2).count_by_status()
3030
EOQ
3131

3232
monitor_thresholds {

host/clock/README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,24 +35,29 @@ No modules.
3535
| <a name="input_alert_critical_priority"></a> [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
3636
| <a name="input_alert_message"></a> [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
3737
| <a name="input_alert_nodata_priority"></a> [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
38-
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br> "resource:apigateway"<br>]</pre> | no |
38+
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br/> "resource:ec2"<br/>]</pre> | no |
3939
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
4040
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
41-
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
41+
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
4242
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
43+
| <a name="input_group_by"></a> [group\_by](#input\_group\_by) | List of tags to group by | `list(string)` | <pre>[<br/> "name",<br/> "aws_account",<br/> "env",<br/> "datadog_managed"<br/>]</pre> | no |
4344
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
4445
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
4546
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
4647
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
48+
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
4749
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
4850
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
4951
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
52+
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
53+
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
5054
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
5155
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
52-
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
56+
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `60` | no |
5357
| <a name="input_runbook_link"></a> [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
5458
| <a name="input_service"></a> [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
5559
| <a name="input_system_clock_enabled"></a> [system\_clock\_enabled](#input\_system\_clock\_enabled) | Flag to enable Host unreachable monitor | `bool` | `true` | no |
60+
| <a name="input_system_clock_use_message"></a> [system\_clock\_use\_message](#input\_system\_clock\_use\_message) | Flag to enable Host unreachable alerting | `bool` | `false` | no |
5661
| <a name="input_team"></a> [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
5762
| <a name="input_timeout_h"></a> [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
5863
| <a name="input_title_prefix"></a> [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |

host/clock/main.tf

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@ locals {
1111
resource "datadog_monitor" "system_clock" {
1212
count = var.system_clock_enabled ? 1 : 0
1313

14-
name = join("", [local.title_prefix, "System Clock - {{name.name}}", local.title_suffix])
14+
name = join("", [local.title_prefix, "System Clock - {{name.name}}", local.title_suffix])
1515
include_tags = false
1616
message = var.system_clock_use_message ? local.query_alert_base_message : ""
17-
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
18-
type = "service check"
17+
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
18+
type = "service check"
1919

2020
evaluation_delay = var.evaluation_delay
2121
new_group_delay = var.new_group_delay
@@ -25,7 +25,7 @@ resource "datadog_monitor" "system_clock" {
2525
timeout_h = var.timeout_h
2626

2727
query = <<EOQ
28-
"ntp.in_sync"${local.service_filter}.by("name","aws_account","env","datadog_managed").last(6).count_by_status()
28+
"ntp.in_sync"${local.service_filter}.by(${local.service_group_by}).last(6).count_by_status()
2929
EOQ
3030

3131
monitor_thresholds {

0 commit comments

Comments
 (0)