Skip to content

Commit 1dbb8a8

Browse files
authored
Merge pull request #15 from rhythmictech/ENG-4060
Added systemd unit monitor
2 parents 101399d + b67838e commit 1dbb8a8

File tree

6 files changed

+146
-1
lines changed

6 files changed

+146
-1
lines changed

host/systemd/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
<!-- BEGIN_TF_DOCS -->
2+
## Requirements
3+
4+
| Name | Version |
5+
|------|---------|
6+
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | ~> 1.5 |
7+
| <a name="requirement_datadog"></a> [datadog](#requirement\_datadog) | >= 3.37 |
8+
| <a name="requirement_null"></a> [null](#requirement\_null) | >= 3.1.0 |
9+
10+
## Providers
11+
12+
| Name | Version |
13+
|------|---------|
14+
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | >= 3.37 |
15+
16+
## Modules
17+
18+
No modules.
19+
20+
## Resources
21+
22+
| Name | Type |
23+
|------|------|
24+
| [datadog_monitor.systemd_unit](https://registry.terraform.io/providers/datadog/datadog/latest/docs/resources/monitor) | resource |
25+
26+
## Inputs
27+
28+
| Name | Description | Type | Default | Required |
29+
|------|-------------|------|---------|:--------:|
30+
| <a name="input_additional_tags"></a> [additional\_tags](#input\_additional\_tags) | Additional tags to apply to all monitors | `list(string)` | `[]` | no |
31+
| <a name="input_alert_critical_priority"></a> [alert\_critical\_priority](#input\_alert\_critical\_priority) | Priority for alerts within critical threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
32+
| <a name="input_alert_message"></a> [alert\_message](#input\_alert\_message) | Message to prepend to alert notifications | `string` | `"Alert"` | no |
33+
| <a name="input_alert_nodata_priority"></a> [alert\_nodata\_priority](#input\_alert\_nodata\_priority) | Priority for alerts within warning threshold (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
34+
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags to apply to all monitors | `list(string)` | `[]` | no |
35+
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
36+
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
37+
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
38+
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
39+
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
40+
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
41+
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
42+
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
43+
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
44+
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
45+
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
46+
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
47+
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
48+
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
49+
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
50+
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
51+
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `60` | no |
52+
| <a name="input_runbook_link"></a> [runbook\_link](#input\_runbook\_link) | Runbook link to include in message | `string` | `null` | no |
53+
| <a name="input_service"></a> [service](#input\_service) | Service associated with the monitored resource (leave blank to omit tag) | `string` | `null` | no |
54+
| <a name="input_systemd_unit_alert_enabled"></a> [systemd\_unit\_alert\_enabled](#input\_systemd\_unit\_alert\_enabled) | Enable or disable the Systemd service alert monitor | `bool` | `true` | no |
55+
| <a name="input_systemd_unit_alert_threshold_critical"></a> [systemd\_unit\_alert\_threshold\_critical](#input\_systemd\_unit\_alert\_threshold\_critical) | Critical threshold for the Systemd service alert (count of services not running/failed) | `number` | `2` | no |
56+
| <a name="input_systemd_unit_alert_threshold_warning"></a> [systemd\_unit\_alert\_threshold\_warning](#input\_systemd\_unit\_alert\_threshold\_warning) | Warning threshold for the Systemd service alert (count of services not running/failed) | `number` | `1` | no |
57+
| <a name="input_systemd_unit_alert_use_message"></a> [systemd\_unit\_alert\_use\_message](#input\_systemd\_unit\_alert\_use\_message) | Whether to use the base message for the Systemd service alert | `bool` | `true` | no |
58+
| <a name="input_systemd_units_filter"></a> [systemd\_units\_filter](#input\_systemd\_units\_filter) | List of specific systemd units (services) to monitor. If empty, monitors all. | `list(string)` | `[]` | no |
59+
| <a name="input_team"></a> [team](#input\_team) | Team supporting the monitored resource (leave blank to omit tag) | `string` | `null` | no |
60+
| <a name="input_timeout_h"></a> [timeout\_h](#input\_timeout\_h) | Auto-resolve alert in specified hours if condition no longer matches | `number` | `0` | no |
61+
| <a name="input_title_prefix"></a> [title\_prefix](#input\_title\_prefix) | Prefix all alerts with specified value in brackets | `string` | `null` | no |
62+
| <a name="input_title_suffix"></a> [title\_suffix](#input\_title\_suffix) | Suffix all alerts with specified value in parenthesis | `string` | `null` | no |
63+
| <a name="input_warn_priority"></a> [warn\_priority](#input\_warn\_priority) | Priority for alerts with no data (P1-P5, uses monitor defaults if not specified) | `string` | `null` | no |
64+
65+
## Outputs
66+
67+
No outputs.
68+
<!-- END_TF_DOCS -->

host/systemd/common.tf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/common.tf

host/systemd/main.tf

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
locals {
2+
monitor_alert_default_priority = null
3+
monitor_warn_default_priority = null
4+
monitor_nodata_default_priority = null
5+
6+
title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
7+
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
8+
}
9+
10+
resource "datadog_monitor" "systemd_unit" {
11+
count = var.systemd_unit_alert_enabled ? 1 : 0
12+
13+
name = join("", [local.title_prefix, "Systemd Unit Status - {{host.name}}", local.title_suffix])
14+
type = "service check"
15+
message = var.systemd_unit_alert_use_message ? local.query_alert_base_message : ""
16+
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
17+
18+
evaluation_delay = var.evaluation_delay
19+
notify_no_data = false
20+
notify_audit = false
21+
renotify_interval = 60
22+
timeout_h = var.timeout_h
23+
include_tags = false
24+
require_full_window = false
25+
26+
query = <<EOT
27+
"systemd.unit.state"${local.service_filter}.by("host","unit").last(3).count_by_status()
28+
EOT
29+
30+
monitor_thresholds {
31+
critical = var.systemd_unit_alert_threshold_critical
32+
warning = var.systemd_unit_alert_threshold_warning
33+
}
34+
}

host/systemd/variables.tf

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
variable "systemd_unit_alert_enabled" {
2+
description = "Enable or disable the Systemd service alert monitor"
3+
type = bool
4+
default = true
5+
}
6+
7+
variable "systemd_unit_alert_use_message" {
8+
description = "Whether to use the base message for the Systemd service alert"
9+
type = bool
10+
default = true
11+
}
12+
13+
variable "systemd_unit_alert_threshold_critical" {
14+
description = "Critical threshold for the Systemd service alert (count of services not running/failed)"
15+
type = number
16+
default = 2
17+
}
18+
19+
variable "systemd_unit_alert_threshold_warning" {
20+
description = "Warning threshold for the Systemd service alert (count of services not running/failed)"
21+
type = number
22+
default = 1
23+
}
24+
25+
variable "systemd_units_filter" {
26+
description = "List of specific systemd units (services) to monitor. If empty, monitors all."
27+
type = list(string)
28+
default = []
29+
}
30+
31+
variable "base_tags" {
32+
description = "Base tags to apply to all monitors"
33+
type = list(string)
34+
default = []
35+
}
36+
37+
variable "additional_tags" {
38+
description = "Additional tags to apply to all monitors"
39+
type = list(string)
40+
default = []
41+
}

host/systemd/versions.tf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../common/versions.tf

host/windows/main.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ resource "datadog_monitor" "windows_service" {
1818

1919
evaluation_delay = var.evaluation_delay
2020
notify_no_data = false
21-
renotify_interval = 0
21+
renotify_interval = 60
2222
notify_audit = false
2323
timeout_h = var.timeout_h
2424
include_tags = false

0 commit comments

Comments
 (0)