Skip to content

Commit 329197a

Browse files
committed
Add Nova Compute Ironic failover procedure
Document: - Moving from multiple instances to a single instance - How to re-deploy the service Config changes: - Prompt users to set a static nova-compute-ironic 'host' name.
1 parent 74a4ac7 commit 329197a

File tree

4 files changed

+231
-3
lines changed

4 files changed

+231
-3
lines changed

doc/source/operations/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,10 @@ This guide is for operators of the StackHPC Kayobe configuration project.
77
.. toctree::
88
:maxdepth: 1
99

10-
upgrading
11-
rabbitmq
12-
octavia
1310
hotfix-playbook
11+
nova-compute-ironic
12+
octavia
13+
rabbitmq
1414
secret-rotation
1515
tempest
16+
upgrading
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
===================
2+
Nova Compute Ironic
3+
===================
4+
5+
This section describes the deployment of the OpenStack Nova Compute
6+
Ironic service. The Nova Compute Ironic service is used to integrate
7+
OpenStack Ironic into Nova as a 'hypervisor' driver. The end users of Nova
8+
can then deploy and manage baremetal hardware, in a similar way to VMs.
9+
10+
High Availability (HA)
11+
======================
12+
13+
The OpenStack Nova Compute service is designed to be installed once on every
14+
hypervisor in an OpenStack deployment. In this configuration, it makes little
15+
sense to run additional service instances. Even if you wanted to, it's not
16+
supported by design. This pattern breaks down with the Ironic baremetal
17+
service, which must run on the OpenStack control plane. It is not feasible
18+
to have a 1:1 mapping of Nova Compute Ironic services to baremetal nodes.
19+
20+
The obvious HA solution is to run multiple instances of Nova Compute Ironic
21+
on the control plane, so that if one fails, the others can take over. However,
22+
due to assumptions long baked into the Nova source code, this is not trivial.
23+
The HA feature provided by the Nova Compute Ironic service has proven to be
24+
unstable, and the direction upstream is to switch to an active/passive
25+
solution [1].
26+
27+
However, challenges still exist with the active/passive solution. Since the
28+
Nova Compute Ironic HA feature is 'always on', one must ensure that only a
29+
single instance (per Ironic conductor group) is ever running. It is not
30+
possible to simply put multiple service instances behind HAProxy and use the
31+
active/passive mode.
32+
33+
Such problems are commonly solved with a technology such as Pacemaker, or in
34+
the modern world, with a container orchestration engine such as Kubernetes.
35+
Kolla Ansible provides neither, because in general it doesn't need to. Its
36+
goal is simplicity.
37+
38+
The interim solution is to therefore run a single Nova Compute Ironic
39+
service. If the service goes down, remedial action must be taken before
40+
Ironic nodes can be managed. In many environments the loss of the Ironic
41+
API for short periods is acceptable, providing that it can be easily
42+
resurrected. The purpose of this document is to faciliate that.
43+
44+
TODO: Add caveats about new sharding mode (not covered here).
45+
46+
Optimal configuration of Nova Compute Ironic
47+
============================================
48+
49+
Determine the current configuration for the site. How many Nova Compute
50+
Ironic instances are running on the control plane?
51+
52+
.. code-block:: console
53+
54+
$ openstack compute service list
55+
56+
Typically you will see either three or one. By default the host will
57+
marked with a postfix, eg. ``controller1-ironic``. If you find more than
58+
one, you will need to remove some instances. You must complete the
59+
following section.
60+
61+
Moving from multiple Nova Compute Instances to a single instance
62+
----------------------------------------------------------------
63+
64+
1. Decide where the single instance should run. Typically, this will be
65+
one of the three control plane hosts. Once you have chosen, set
66+
the following variable in ``etc/kayobe/nova.yml``. Here we have
67+
picked ``controller1``.
68+
69+
.. code-block:: console
70+
71+
kolla_nova_compute_ironic_host: controller1
72+
73+
2. Ensure that you have organised a maintenance window, during which
74+
there will be no Ironic operations. You will be breaking the Ironic
75+
API.
76+
77+
3. Perform a database backup.
78+
79+
.. code-block:: console
80+
81+
$ kayobe overcloud database backup -vvv
82+
83+
Check the output of the command, and locate the backup files.
84+
85+
4. Identify baremetal nodes associated with Nova Compute Ironic instances
86+
that will be removed. You don't need to do anything with these
87+
specifically, it's just for reference later. For example:
88+
89+
.. code-block:: console
90+
91+
$ openstack baremetal node list --long -c "Instance Info" | grep controller3-ironic | wc -l
92+
61
93+
$ openstack baremetal node list --long -c "Instance Info" | grep controller2-ironic | wc -l
94+
35
95+
$ openstack baremetal node list --long -c "Instance Info" | grep controller1-ironic | wc -l
96+
55
97+
98+
5. Disable the redundant Nova Compute Ironic services:
99+
100+
.. code-block:: console
101+
102+
$ openstack compute service set controller3-ironic nova-compute --disable
103+
$ openstack compute service set controller2-ironic nova-compute --disable
104+
105+
6. Delete the redundant Nova Compute Ironic services. You will need the service
106+
ID. For example:
107+
108+
.. code-block:: console
109+
110+
$ ID=$(openstack compute service list | grep foo | awk '{print $2}')
111+
$ openstack compute service delete --os-compute-api-version 2.53 $ID
112+
113+
In older releases, you may hit a bug where the service can't be deleted if it
114+
is not managing any instances. In this case just move on and leave the service
115+
disabled. Eg.
116+
117+
.. code-block:: console
118+
119+
$ openstack compute service delete --os-compute-api-version 2.53 c993b57e-f60c-4652-8328-5fb0e17c99c0
120+
Failed to delete compute service with ID 'c993b57e-f60c-4652-8328-5fb0e17c99c0': HttpException: 500: Server Error for url:
121+
https://acme.pl-2.internal.hpc.is:8774/v2.1/os-services/c993b57e-f60c-4652-8328-5fb0e17c99c0, Unexpected API Error.
122+
Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
123+
124+
7. Remove the Docker containers for the redundant Nova Compute Ironic services:
125+
126+
.. code-block:: console
127+
128+
$ ssh controller2 sudo docker rm -f nova_compute_ironic
129+
$ ssh controller3 sudo docker rm -f nova_compute_ironic
130+
131+
8. Ensure that all Ironic nodes are using the single remaining Nova Compute
132+
Ironic instance. Eg. Baremetal nodes in use by compute instances will not
133+
fail over to the remaining Nova Compute Ironic service. Here, the active
134+
service is running on ``controller1``:
135+
136+
.. code-block:: console
137+
138+
$ ssh controller1
139+
$ sudo docker exec -it mariadb mysql -u nova -p$(sudo grep 'mysql+pymysql://nova:' /etc/kolla/nova-api/nova.conf | awk -F'[:,@]' '{print $3}')
140+
$ MariaDB [(none)]> use nova;
141+
142+
Proceed with caution. It is good practise to update one record first:
143+
144+
.. code-block:: console
145+
146+
$ MariaDB [nova]> update instances set host='controller1-ironic' where uuid=0 and host='controller3-ironic' limit 1;
147+
Query OK, 1 row affected (0.002 sec)
148+
Rows matched: 1 Changed: 1 Warnings: 0
149+
150+
At this stage you should go back to step 4 and check that the numbers have
151+
changed as expected. When you are happy, update remaining records for all
152+
services which have been removed:
153+
154+
.. code-block:: console
155+
156+
$ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller3-ironic';
157+
Query OK, 59 rows affected (0.009 sec)
158+
Rows matched: 59 Changed: 59 Warnings: 0
159+
$ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller2-ironic';
160+
Query OK, 35 rows affected (0.003 sec)
161+
Rows matched: 35 Changed: 35 Warnings: 0
162+
163+
9. Repeat step 4. Verify that all Ironic nodes are using the single remaining
164+
Nova Compute Ironic instance.
165+
166+
167+
Making it easy to re-deploy Nova Compute Ironic
168+
-----------------------------------------------
169+
170+
In the previous section we saw that at any given time, a baremetal node is
171+
associated with a single Nova Compute Ironic instance. At this stage, assuming
172+
that you have diligently followed the instructions, you are in the situation
173+
where all Ironic baremetal nodes are managed by a single Nova Compute Ironic
174+
instance. If this service goes down, you will not be able to manage /any/
175+
baremetal nodes.
176+
177+
By default, the single remaining Nova Compute Ironic instance will be named
178+
after the host on which it is deployed. The host name is passed to the Nova
179+
Compute Ironic instance via the default section of the ``nova.conf`` file,
180+
using the field: ``host``.
181+
182+
If you wish to re-deploy this instance, for example because the original host
183+
was permanently mangled in the World Server Throwing Championship [2], you
184+
must ensure that the new instance has the same name as the old one. Simply
185+
setting ``kolla_nova_compute_ironic_host`` to another controller and
186+
re-deploying the service is not enough; the new instance will be named after
187+
the new host.
188+
189+
To work around this you should set the ``host`` field in ``nova.conf`` to a
190+
constant, such that the new Nova Compute Ironic instance comes up with the
191+
same name as the one it replaces.
192+
193+
For example, if the original instance resides on ``controller1``, then set the
194+
following in ``etc/kayobe/nova.yml``:
195+
196+
.. code-block:: console
197+
198+
kolla_nova_compute_ironic_static_host_name: controller1-ironic
199+
200+
Note that an ``-ironic`` postfix is added to the hostname. This comes from
201+
a convention in Kolla Ansible. It is worth making this change ahead of time,
202+
even if you don't need to immediately re-deploy the service.
203+
204+
It is also possible to use an arbitrary ``host`` name, but you will need
205+
to edit the database again. That is an optional exercise left for the reader.
206+
See [1] for further details.
207+
208+
TODO: Investigate KA bug with assumption about host field.
209+
210+
[1] https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/ironic-shards.html#migrate-from-peer-list-to-shard-key
211+
[2] https://www.cloudfest.com/world-server-throwing-championship
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{% if kolla_enable_ironic|bool and kolla_nova_compute_ironic_host is not none %}
2+
[DEFAULT]
3+
host = {{ kolla_nova_compute_ironic_static_host_name | mandatory('You must set a static host name to help with service failover. See the operations documentation, Ironic section.') }}
4+
{% endif %}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
fixes:
3+
- |
4+
Adds basic support and a document explaining how to migrate to a single
5+
nova-compute-ironic instance, and how to re-deploy the instance to another
6+
machine in the event of failure. See the operations / nova-compute-ironic
7+
doc for further details.
8+
upgrade:
9+
- |
10+
Ensure that your deployment has only one nova-compute-ironic service running
11+
per conductor group. See the operations / nova-compute-ironic doc for further
12+
details.

0 commit comments

Comments
 (0)