Skip to content

Commit e240605

Browse files
authored
Add optional prometheus/grafana monitoring stack (#74)
1 parent 7dc09f7 commit e240605

File tree

8 files changed

+248
-83
lines changed

8 files changed

+248
-83
lines changed

.vault-pass.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
# This script retrieves the password for the current git repository from a Bitwarden Vault.
33
# If bw is not available, it should error out
44
set -e
5-
bw get password "$(git remote get-url origin | awk '{split($0, a, "/"); print a[length(a)]}')"
5+
bw get password "$(git remote get-url origin | awk '{split($0, a, "/"); print a[length(a)]}')" || { read -sp "Password or Bitwarden CLI not found. Enter vault password: " password; echo "$password"; }

README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,9 @@ ungrouped:
132132
borg_encryption_passphrase: <the passphrase for the borg encryption>
133133
borg_remote_path: <the command to run borg on the repository (e.g., borg1 vs borg2)>
134134
borg_repository: <the path to the borg repository, either local or remote>
135+
prometheus_remote_write_url: <your_prometheus_instance_url, e.g., https://grafana.datalab.industries/prometheus/api/v1/write>
136+
prometheus_user: <your_prometheus_username>
137+
prometheus_password: <your_prometheus_password>
135138
```
136139
137140
where `<hostname>` and the various setting should be configured with your chosen
@@ -259,6 +262,15 @@ version to use the same name using:
259262
git remote set-url origin <my-git-repo-url>
260263
```
261264

265+
You can also simply edit the `.vault-pass.sh` script to return your vault
266+
password in another way if you prefer.
267+
268+
You may also need to set the script to be executable with
269+
270+
```shell
271+
chmod u+x .vault-pass.sh
272+
```
273+
262274
#### Backups
263275

264276
##### Native backups
@@ -297,6 +309,36 @@ to set up any extra settings (e.g., proxies, host checking), or an `.ssh/known_h
297309
> for your encrypted Borg backups, feel free to reach out to us as we may have
298310
> enough of an overhead to be a secondary backup host for you.
299311
312+
#### Server monitoring
313+
314+
One basic option for uptime monitoring is to use a free GitHub Actions based
315+
service like [Upptime](https://github.com/upptime/upptime).
316+
For example, this is used for simple services in the central *datalab* organisation at [datalab-org/datalab-org-status](https://github.com/datalab-org/datalab-org-status).
317+
318+
For more advanced monitoring, the Ansible playbooks contain a role tagged as
319+
`monitoring`, which will install and configure metrics harvesters using
320+
[Prometheus](https://prometheus.io/) (with [Node Exporter](https://github.com/prometheus/node_exporter) and
321+
[cAdvisor](https://github.com/google/cadvisor)) to monitor the host system and
322+
containers.
323+
324+
To make use of this monitoring, you will need your own [Grafana instance](https://grafana.com/oss/grafana) (also running Prometheus as a harvester of the remote metrics) to visualise the metrics.
325+
326+
Alternatively, you can use a hosted Grafana service such as [Grafana Cloud](https://grafana.com/products/cloud/), or request to use our central
327+
*datalab* Grafana instance by reaching out to us on Slack or over email.
328+
329+
This integration can be enabled by adding the following variables to your inventory:
330+
331+
```yaml
332+
prometheus_remote_write_url: <your_prometheus_instance_url, e.g., https://grafana.datalab.industries/prometheus/api/v1/write>
333+
prometheus_user: <your_prometheus_username>
334+
prometheus_password: <your_prometheus_password>
335+
```
336+
and then running the playbook with the `monitoring` tag:
337+
338+
```shell
339+
make monitoring
340+
```
341+
300342
### Cloud provisioning
301343

302344
These instructions will use OpenTofu, an open source fork of Terraform.

ansible/inventory.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,14 @@ ungrouped:
44
<hostname>:
55
ansible_become_method: sudo
66
ansible_user: <remote_username>
7+
datalab_prefix: <desired_datalab_prefix (used for monitoring labels)>
78
api_url: <desired_datalab_api_url>
89
app_url: <desired_datalab_app_url>
910
mount_data_disk: <disk device file location, e.g., /dev/sda, /dev/sdb or otherwise>
1011
data_disk_type: <the fstype of the data disk, defaults to 'xfs'
1112
borg_encryption_passphrase: <the passphrase for the borg encryption>
1213
borg_remote_path: <the command to run borg on the repository (e.g., borg1 vs borg2)>
1314
borg_repository: <the path to the borg repository, either local or remote>
15+
prometheus_remote_write_url: <your_prometheus_instance_url, e.g., https://grafana.datalab.industries/prometheus/api/v1/write>
16+
prometheus_user: <the username to access the prometheus server>
17+
prometheus_password: <the password to access the prometheus server>

ansible/playbook.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@
2929
- role: borg
3030
name: Configure borg(matic) and remote backups
3131
tags: [borg]
32+
- role: monitoring
33+
name: Install and configure prometheus monitoring stack
34+
tags: [monitoring]
3235

3336
tasks:
3437
- name: Keep all packages up-to-date

ansible/roles/borg/tasks/main.yml

Lines changed: 84 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,97 @@
11
---
2-
- name: Synchronize borgmatic files to remote
3-
ansible.posix.synchronize:
4-
src: "{{ role_path }}/files/"
5-
dest: "{{ ansible_user_home_dir }}/borgmatic"
6-
7-
- name: Check whether ssh config exists
2+
- name: Check whether ssh cert for borg exists
83
ansible.builtin.stat:
94
path: "{{ playbook_dir }}/vaults/borg/.ssh/id_ed25519"
105
register: ssh_config
116
delegate_to: localhost
127

13-
- name: Set fact for whether borg ssh config exists
14-
ansible.builtin.set_fact:
15-
ssh_config_defined: "{{ ssh_config.stat.exists }}"
8+
- name: Configure borgmatic and borg backups
9+
when:
10+
- ssh_config.stat.exists
11+
- borg_repository is defined
12+
- borg_encryption_passphrase is defined
13+
- borg_remote_path is defined
14+
block:
15+
- name: Synchronize borgmatic files to remote
16+
ansible.posix.synchronize:
17+
src: "{{ role_path }}/files/"
18+
dest: "{{ ansible_user_home_dir }}/borgmatic"
1619

17-
- name: Sync local ssh config vault remote
18-
when: ssh_config_defined
19-
become: true
20-
ansible.builtin.copy:
21-
src: "{{ playbook_dir }}/vaults/borg/.ssh/"
22-
dest: "{{ ansible_user_home_dir }}/borgmatic/.ssh"
23-
mode: "0700"
24-
owner: root
25-
group: root
20+
- name: Sync local ssh config vault remote
21+
when: ssh_config_defined
22+
become: true
23+
ansible.builtin.copy:
24+
src: "{{ playbook_dir }}/vaults/borg/.ssh/"
25+
dest: "{{ ansible_user_home_dir }}/borgmatic/.ssh"
26+
mode: "0700"
27+
owner: root
28+
group: root
2629

27-
- name: Render Borgmatic configuration
28-
become: true
29-
ansible.builtin.template:
30-
src: config.yaml.j2
31-
dest: "{{ ansible_user_home_dir }}/borgmatic/config.yaml"
32-
mode: "0600"
33-
owner: "{{ ansible_ssh_user }}"
30+
- name: Render Borgmatic configuration
31+
become: true
32+
ansible.builtin.template:
33+
src: config.yaml.j2
34+
dest: "{{ ansible_user_home_dir }}/borgmatic/config.yaml"
35+
mode: "0600"
36+
owner: "{{ ansible_ssh_user }}"
3437

35-
vars:
36-
borg_exclude_patterns:
37-
- /data/backups
38-
borg_exclude_from: []
39-
borg_install_method: package
40-
borg_user: "{{ ansible_user }}"
41-
borg_source_directories:
42-
- /data
43-
borgmatic_hooks:
44-
before_backup:
45-
- echo "`date` - Starting backup."
46-
mongodb_databases:
47-
- name: all
48-
hostname: datalab-database-1
49-
port: 27017
50-
borgmatic_timer: cron
51-
borg_retention_policy:
52-
keep_daily: 30
53-
keep_weekly: 0
54-
keep_monthly: 12
55-
keep_yearly: 4
56-
borg_one_file_system: true
57-
borgmatic_store_atime: true
58-
borgmatic_store_ctime: true
59-
borg_encryption_passcommand: false
60-
borg_remote_rate_limit: 0
61-
borg_ssh_command: ssh
62-
borg_lock_wait_time: 5
38+
vars:
39+
borg_exclude_patterns:
40+
- /data/backups
41+
borg_exclude_from: []
42+
borg_install_method: package
43+
borg_user: "{{ ansible_user }}"
44+
borg_source_directories:
45+
- /data
46+
borgmatic_hooks:
47+
before_backup:
48+
- echo "`date` - Starting backup."
49+
mongodb_databases:
50+
- name: all
51+
hostname: datalab-database-1
52+
port: 27017
53+
borgmatic_timer: cron
54+
borg_retention_policy:
55+
keep_daily: 30
56+
keep_weekly: 0
57+
keep_monthly: 12
58+
keep_yearly: 4
59+
borg_one_file_system: true
60+
borgmatic_store_atime: true
61+
borgmatic_store_ctime: true
62+
borg_encryption_passcommand: false
63+
borg_remote_rate_limit: 0
64+
borg_ssh_command: ssh
65+
borg_lock_wait_time: 5
6366

64-
- name: Build borgmatic image
65-
become: true
66-
community.docker.docker_image:
67-
name: datalab-borgmatic
68-
source: build
69-
state: present
70-
force_source: true
71-
build:
72-
path: "{{ ansible_user_home_dir }}/borgmatic"
67+
- name: Build borgmatic image
68+
become: true
69+
community.docker.docker_image:
70+
name: datalab-borgmatic
71+
source: build
72+
state: present
73+
force_source: true
74+
build:
75+
path: "{{ ansible_user_home_dir }}/borgmatic"
7376

74-
- name: Create borg repository if it does not exist
75-
ansible.builtin.shell:
76-
cmd: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic borgmatic init --encryption=repokey -c /etc/borgmatic/config.yaml
77-
executable: /bin/bash
78-
register: new_borg_repository
79-
changed_when: '"Repository already exists" not in new_borg_repository.stdout'
80-
failed_when: new_borg_repository.rc != 0
77+
- name: Create borg repository if it does not exist
78+
ansible.builtin.shell:
79+
cmd: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic borgmatic init --encryption=repokey -c /etc/borgmatic/config.yaml
80+
executable: /bin/bash
81+
register: new_borg_repository
82+
changed_when: '"Repository already exists" not in new_borg_repository.stdout'
83+
failed_when: new_borg_repository.rc != 0
8184

82-
- name: Perform first backup if borg repo was just created
83-
ansible.builtin.shell:
84-
cmd: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic # noqa: no-changed-when
85-
executable: /bin/bash
86-
when: new_borg_repository.changed # noqa: no-handler
85+
- name: Perform first backup if borg repo was just created
86+
ansible.builtin.shell:
87+
cmd: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic # noqa: no-changed-when
88+
executable: /bin/bash
89+
when: new_borg_repository.changed # noqa: no-handler
8790

88-
- name: Add Cron job for borgmatic
89-
ansible.builtin.cron:
90-
name: borgmatic
91-
hour: "2"
92-
minute: "{{ range(0, 59) | random(seed=inventory_hostname) }}"
93-
user: "{{ ansible_user }}"
94-
job: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic # noqa: line-length
91+
- name: Add Cron job for borgmatic
92+
ansible.builtin.cron:
93+
name: borgmatic
94+
hour: "2"
95+
minute: "{{ range(0, 59) | random(seed=inventory_hostname) }}"
96+
user: "{{ ansible_user }}"
97+
job: docker run --rm --network datalab_backend -v {{ ansible_user_home_dir }}/borgmatic/.ssh:/root/.ssh -v /data:/data datalab-borgmatic # noqa: line-length
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
- name: Check if prometheus_url is configured in inventory
3+
when:
4+
- prometheus_remote_write_url is defined
5+
- prometheus_user is defined
6+
- prometheus_password is defined
7+
block:
8+
- name: Launch node_exporter container
9+
community.docker.docker_container:
10+
name: node-exporter
11+
image: prom/node-exporter:v1.9.1
12+
network_mode: host
13+
pid_mode: host
14+
state: started
15+
restart_policy: always
16+
command:
17+
- --path.rootfs=/host
18+
- --path.procfs=/host/proc
19+
- --path.sysfs=/host/sys
20+
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
21+
ports:
22+
- 9100:9100
23+
volumes:
24+
# Mount the host's /proc directory to the container's /host/proc directory
25+
# This is necessary for the node_exporter to be able to access the host's metrics
26+
- /:/host:ro,rslave
27+
- /proc:/host/proc:ro
28+
- /sys:/host/sys:ro
29+
- /etc/machine-id:/etc/machine-id:ro
30+
- /etc/timezone:/etc/timezone:ro
31+
healthcheck:
32+
test: [CMD, wget, -q, --spider, http://localhost:9100/metrics]
33+
interval: 30s
34+
timeout: 10s
35+
retries: 3
36+
start_period: 5s
37+
38+
- name: Launch cadvisor container
39+
community.docker.docker_container:
40+
name: cadvisor
41+
image: gcr.io/cadvisor/cadvisor:v0.52.1
42+
network_mode: host
43+
pid_mode: host
44+
state: started
45+
restart_policy: always
46+
ports:
47+
- 8080:8080
48+
volumes:
49+
- /:/rootfs:ro
50+
- /var/run:/var/run:rw
51+
- /sys:/sys:ro
52+
- /var/lib/docker/:/var/lib/docker:ro
53+
54+
- name: Render prometheus config
55+
ansible.builtin.template:
56+
src: prometheus.yml.j2
57+
dest: "{{ ansible_user_home_dir }}/prometheus.yml"
58+
mode: "0644"
59+
register: prometheus_config
60+
61+
- name: Launch prometheus container
62+
community.docker.docker_container:
63+
name: prometheus
64+
image: prom/prometheus:v3.6.0
65+
recreate: "{{ prometheus_config.changed }}"
66+
network_mode: host
67+
state: started
68+
restart_policy: always
69+
command:
70+
- --config.file=/etc/prometheus/prometheus.yml
71+
- --storage.tsdb.path=/prometheus
72+
- --web.console.libraries=/etc/prometheus/console_libraries
73+
- --web.console.templates=/etc/prometheus/consoles
74+
- --web.enable-lifecycle
75+
ports:
76+
- 9090:9090
77+
volumes:
78+
- "prometheus_data:/prometheus"
79+
- "{{ ansible_user_home_dir }}/prometheus.yml:/etc/prometheus/prometheus.yml:ro"
80+
healthcheck:
81+
test: [CMD, wget, -q, --spider, http://localhost:9090/-/healthy]
82+
interval: 30s
83+
timeout: 10s
84+
retries: 3
85+
start_period: 5s
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
global:
2+
scrape_interval: 1m
3+
scrape_timeout: 45s
4+
5+
external_labels:
6+
instance: {{ app_url }}
7+
environment: production
8+
app: datalab
9+
datalab: {{ datalab_prefix | default(app_url) }}
10+
11+
scrape_configs:
12+
- job_name: 'node-exporter'
13+
static_configs:
14+
- targets: ['localhost:9100']
15+
labels:
16+
service: system
17+
18+
- job_name: 'cadvisor'
19+
static_configs:
20+
- targets: ['localhost:8080']
21+
labels:
22+
service: docker
23+
24+
remote_write:
25+
- url: {{ prometheus_remote_write_url }}
26+
basic_auth:
27+
username: {{ prometheus_user }}
28+
password: {{ prometheus_password }}

sync-ansible-upstream.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/bin/bash
22
set -e -u -o pipefail
33
commit=$(cd src/datalab-ansible-terraform && git describe --tags)
4-
rsync --exclude vaults --exclude inventory.yml -avr src/datalab-ansible-terraform/ansible .
4+
rsync --exclude vaults --exclude inventory.yml -avr src/datalab-ansible-terraform/sync-ansible-upstream.sh src/datalab-ansible-terraform/Makefile src/datalab-ansible-terraform/.vault-pass.sh src/datalab-ansible-terraform/README.md src/datalab-ansible-terraform/requirements.txt src/datalab-ansible-terraform/ansible .
55
git add -p ansible
66
git add src/datalab-ansible-terraform
77
git add $(git ls-files ansible --others --exclude-standard)

0 commit comments

Comments
 (0)