|
| 1 | +=================== |
| 2 | +Nova Compute Ironic |
| 3 | +=================== |
| 4 | + |
| 5 | +This section describes the deployment of the OpenStack Nova Compute |
| 6 | +Ironic service. The Nova Compute Ironic service is used to integrate |
| 7 | +OpenStack Ironic into Nova as a 'hypervisor' driver. The end users of Nova |
| 8 | +can then deploy and manage baremetal hardware, in a similar way to VMs. |
| 9 | + |
| 10 | +High Availability (HA) |
| 11 | +====================== |
| 12 | + |
| 13 | +The OpenStack Nova Compute service is designed to be installed once on every |
| 14 | +hypervisor in an OpenStack deployment. In this configuration, it makes little |
| 15 | +sense to run additional service instances. Even if you wanted to, it's not |
| 16 | +supported by design. This pattern breaks down with the Ironic baremetal |
| 17 | +service, which must run on the OpenStack control plane. It is not feasible |
| 18 | +to have a 1:1 mapping of Nova Compute Ironic services to baremetal nodes. |
| 19 | + |
| 20 | +The obvious HA solution is to run multiple instances of Nova Compute Ironic |
| 21 | +on the control plane, so that if one fails, the others can take over. However, |
| 22 | +due to assumptions long baked into the Nova source code, this is not trivial. |
| 23 | +The HA feature provided by the Nova Compute Ironic service has proven to be |
| 24 | +unstable, and the direction upstream is to switch to an active/passive |
| 25 | +solution [1]. |
| 26 | + |
| 27 | +However, challenges still exist with the active/passive solution. Since the |
| 28 | +Nova Compute Ironic HA feature is 'always on', one must ensure that only a |
| 29 | +single instance (per Ironic conductor group) is ever running. It is not |
| 30 | +possible to simply put multiple service instances behind HAProxy and use the |
| 31 | +active/passive mode. |
| 32 | + |
| 33 | +Such problems are commonly solved with a technology such as Pacemaker, or in |
| 34 | +the modern world, with a container orchestration engine such as Kubernetes. |
| 35 | +Kolla Ansible provides neither, because in general it doesn't need to. Its |
| 36 | +goal is simplicity. |
| 37 | + |
| 38 | +The interim solution is to therefore run a single Nova Compute Ironic |
| 39 | +service. If the service goes down, remedial action must be taken before |
| 40 | +Ironic nodes can be managed. In many environments the loss of the Ironic |
| 41 | +API for short periods is acceptable, providing that it can be easily |
| 42 | +resurrected. The purpose of this document is to faciliate that. |
| 43 | + |
| 44 | +TODO: Add caveats about new sharding mode (not covered here). |
| 45 | + |
| 46 | +Optimal configuration of Nova Compute Ironic |
| 47 | +============================================ |
| 48 | + |
| 49 | +Determine the current configuration for the site. How many Nova Compute |
| 50 | +Ironic instances are running on the control plane? |
| 51 | + |
| 52 | +.. code-block:: console |
| 53 | +
|
| 54 | + $ openstack compute service list |
| 55 | +
|
| 56 | +Typically you will see either three or one. By default the host will |
| 57 | +marked with a postfix, eg. ``controller1-ironic``. If you find more than |
| 58 | +one, you will need to remove some instances. You must complete the |
| 59 | +following section. |
| 60 | + |
| 61 | +Moving from multiple Nova Compute Instances to a single instance |
| 62 | +---------------------------------------------------------------- |
| 63 | + |
| 64 | +1. Decide where the single instance should run. Typically, this will be |
| 65 | + one of the three control plane hosts. Once you have chosen, set |
| 66 | + the following variable in ``etc/kayobe/nova.yml``. Here we have |
| 67 | + picked ``controller1``. |
| 68 | + |
| 69 | + .. code-block:: console |
| 70 | +
|
| 71 | + kolla_nova_compute_ironic_host: controller1 |
| 72 | +
|
| 73 | +2. Ensure that you have organised a maintenance window, during which |
| 74 | + there will be no Ironic operations. You will be breaking the Ironic |
| 75 | + API. |
| 76 | + |
| 77 | +3. Perform a database backup. |
| 78 | + |
| 79 | + .. code-block:: console |
| 80 | +
|
| 81 | + $ kayobe overcloud database backup -vvv |
| 82 | +
|
| 83 | + Check the output of the command, and locate the backup files. |
| 84 | + |
| 85 | +4. Identify baremetal nodes associated with Nova Compute Ironic instances |
| 86 | + that will be removed. You don't need to do anything with these |
| 87 | + specifically, it's just for reference later. For example: |
| 88 | + |
| 89 | + .. code-block:: console |
| 90 | +
|
| 91 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller3-ironic | wc -l |
| 92 | + 61 |
| 93 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller2-ironic | wc -l |
| 94 | + 35 |
| 95 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller1-ironic | wc -l |
| 96 | + 55 |
| 97 | +
|
| 98 | +5. Disable the redundant Nova Compute Ironic services: |
| 99 | + |
| 100 | + .. code-block:: console |
| 101 | +
|
| 102 | + $ openstack compute service set controller3-ironic nova-compute --disable |
| 103 | + $ openstack compute service set controller2-ironic nova-compute --disable |
| 104 | +
|
| 105 | +6. Delete the redundant Nova Compute Ironic services. You will need the service |
| 106 | + ID. For example: |
| 107 | + |
| 108 | + .. code-block:: console |
| 109 | +
|
| 110 | + $ ID=$(openstack compute service list | grep foo | awk '{print $2}') |
| 111 | + $ openstack compute service delete --os-compute-api-version 2.53 $ID |
| 112 | +
|
| 113 | + In older releases, you may hit a bug where the service can't be deleted if it |
| 114 | + is not managing any instances. In this case just move on and leave the service |
| 115 | + disabled. Eg. |
| 116 | + |
| 117 | + .. code-block:: console |
| 118 | +
|
| 119 | + $ openstack compute service delete --os-compute-api-version 2.53 c993b57e-f60c-4652-8328-5fb0e17c99c0 |
| 120 | + Failed to delete compute service with ID 'c993b57e-f60c-4652-8328-5fb0e17c99c0': HttpException: 500: Server Error for url: |
| 121 | + https://acme.pl-2.internal.hpc.is:8774/v2.1/os-services/c993b57e-f60c-4652-8328-5fb0e17c99c0, Unexpected API Error. |
| 122 | + Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. |
| 123 | +
|
| 124 | +7. Remove the Docker containers for the redundant Nova Compute Ironic services: |
| 125 | + |
| 126 | + .. code-block:: console |
| 127 | +
|
| 128 | + $ ssh controller2 sudo docker rm -f nova_compute_ironic |
| 129 | + $ ssh controller3 sudo docker rm -f nova_compute_ironic |
| 130 | +
|
| 131 | +8. Ensure that all Ironic nodes are using the single remaining Nova Compute |
| 132 | + Ironic instance. Eg. Baremetal nodes in use by compute instances will not |
| 133 | + fail over to the remaining Nova Compute Ironic service. Here, the active |
| 134 | + service is running on ``controller1``: |
| 135 | + |
| 136 | + .. code-block:: console |
| 137 | +
|
| 138 | + $ ssh controller1 |
| 139 | + $ sudo docker exec -it mariadb mysql -u nova -p$(sudo grep 'mysql+pymysql://nova:' /etc/kolla/nova-api/nova.conf | awk -F'[:,@]' '{print $3}') |
| 140 | + $ MariaDB [(none)]> use nova; |
| 141 | +
|
| 142 | + Proceed with caution. It is good practise to update one record first: |
| 143 | + |
| 144 | + .. code-block:: console |
| 145 | +
|
| 146 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where uuid=0 and host='controller3-ironic' limit 1; |
| 147 | + Query OK, 1 row affected (0.002 sec) |
| 148 | + Rows matched: 1 Changed: 1 Warnings: 0 |
| 149 | +
|
| 150 | + At this stage you should go back to step 4 and check that the numbers have |
| 151 | + changed as expected. When you are happy, update remaining records for all |
| 152 | + services which have been removed: |
| 153 | + |
| 154 | + .. code-block:: console |
| 155 | +
|
| 156 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller3-ironic'; |
| 157 | + Query OK, 59 rows affected (0.009 sec) |
| 158 | + Rows matched: 59 Changed: 59 Warnings: 0 |
| 159 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller2-ironic'; |
| 160 | + Query OK, 35 rows affected (0.003 sec) |
| 161 | + Rows matched: 35 Changed: 35 Warnings: 0 |
| 162 | +
|
| 163 | +9. Repeat step 4. Verify that all Ironic nodes are using the single remaining |
| 164 | + Nova Compute Ironic instance. |
| 165 | + |
| 166 | + |
| 167 | +Making it easy to re-deploy Nova Compute Ironic |
| 168 | +----------------------------------------------- |
| 169 | + |
| 170 | +In the previous section we saw that at any given time, a baremetal node is |
| 171 | +associated with a single Nova Compute Ironic instance. At this stage, assuming |
| 172 | +that you have diligently followed the instructions, you are in the situation |
| 173 | +where all Ironic baremetal nodes are managed by a single Nova Compute Ironic |
| 174 | +instance. If this service goes down, you will not be able to manage /any/ |
| 175 | +baremetal nodes. |
| 176 | + |
| 177 | +By default, the single remaining Nova Compute Ironic instance will be named |
| 178 | +after the host on which it is deployed. The host name is passed to the Nova |
| 179 | +Compute Ironic instance via the default section of the ``nova.conf`` file, |
| 180 | +using the field: ``host``. |
| 181 | + |
| 182 | +If you wish to re-deploy this instance, for example because the original host |
| 183 | +was permanently mangled in the World Server Throwing Championship [2], you |
| 184 | +must ensure that the new instance has the same name as the old one. Simply |
| 185 | +setting ``kolla_nova_compute_ironic_host`` to another controller and |
| 186 | +re-deploying the service is not enough; the new instance will be named after |
| 187 | +the new host. |
| 188 | + |
| 189 | +To work around this you should set the ``host`` field in ``nova.conf`` to a |
| 190 | +constant, such that the new Nova Compute Ironic instance comes up with the |
| 191 | +same name as the one it replaces. |
| 192 | + |
| 193 | +For example, if the original instance resides on ``controller1``, then set the |
| 194 | +following in ``etc/kayobe/nova.yml``: |
| 195 | + |
| 196 | +.. code-block:: console |
| 197 | +
|
| 198 | + kolla_nova_compute_ironic_static_host_name: controller1-ironic |
| 199 | +
|
| 200 | +Note that an ``-ironic`` postfix is added to the hostname. This comes from |
| 201 | +a convention in Kolla Ansible. It is worth making this change ahead of time, |
| 202 | +even if you don't need to immediately re-deploy the service. |
| 203 | + |
| 204 | +It is also possible to use an arbitrary ``host`` name, but you will need |
| 205 | +to edit the database again. That is an optional exercise left for the reader. |
| 206 | +See [1] for further details. |
| 207 | + |
| 208 | +TODO: Investigate KA bug with assumption about host field. |
| 209 | + |
| 210 | +[1] https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/ironic-shards.html#migrate-from-peer-list-to-shard-key |
| 211 | +[2] https://www.cloudfest.com/world-server-throwing-championship |
0 commit comments