Skip to content

Commit e8c9e7d

Browse files
ddhodgeaishwarya24
andauthored
[doc][yba][DOC-521][2024.2] Auto master failover (yugabyte#24875)
* auto master failover * auto master failover * review comments * review comment * Update docs/content/v2024.2/yugabyte-platform/manage-deployments/remove-nodes.md Co-authored-by: Aishwarya Chakravarthy <achakravarthy@yugabyte.com> * review comments --------- Co-authored-by: Aishwarya Chakravarthy <achakravarthy@yugabyte.com>
1 parent d2a53ed commit e8c9e7d

File tree

2 files changed

+132
-9
lines changed

2 files changed

+132
-9
lines changed

docs/content/preview/yugabyte-platform/manage-deployments/remove-nodes.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,67 @@ menu:
1616
type: docs
1717
---
1818

19+
## Automatic YB-Master failover
20+
21+
{{<tags/feature/ea>}} To avoid under-replication, YugabyteDB Anywhere can automatically detect a YB-Master server that is not responding to the master leader, or that is lagging WAL operations, and fail over to another available node in the same availability zone.
22+
23+
Note that automatic failover only works for a single unhealthy master server.
24+
25+
### Prerequisites
26+
27+
- Automatic YB-Master failover is {{<tags/feature/ea>}}. To enable the feature for a universe, set the **Auto Master Failover** Universe Runtime Configuration option (config key `yb.auto_master_failover.enabled`) to true. Refer to [Manage runtime configuration settings](../../administer-yugabyte-platform/manage-runtime-config/).
28+
- The universe has the following characteristics:
29+
30+
- running v2.20.3.0, v2.21.0.0, or later
31+
- is on-premises or on a cloud provider (Kubernetes is not supported)
32+
- has a replication factor of 3 or more
33+
- does not have dedicated masters
34+
35+
- A replacement node (running a TServer) is available in the same availability zone. (Read replica nodes are not valid for failover.)
36+
37+
### How it works
38+
39+
Automatic master failover works as follows:
40+
41+
1. When active, by default YugabyteDB Anywhere checks universe masters every minute to see if they are healthy.
42+
43+
You can customize this interval using the universe runtime configuration option `yb.auto_master_failover.detect_interval`.
44+
45+
1. YugabyteDB Anywhere declares a master is failed or potentially failed when any of the following conditions are met:
46+
47+
- The Master heartbeat delay is greater than the threshold.
48+
- Maximum tablet follower lag is greater than the threshold.
49+
50+
1. When YugabyteDB Anywhere detects an unhealthy master in a universe, it displays a message on the universe **Overview** indicating a potential master failure, and indicating the estimated time remaining until auto failover.
51+
52+
The warning is displayed when a master lags more than the threshold defined by the universe runtime configuration option `yb.auto_master_failover.master_follower_lag_soft_threshold`.
53+
54+
You can configure the time to failover using the universe runtime configuration option `yb.auto_master_failover.master_follower_lag_hard_threshold`.
55+
56+
1. During this time, you can investigate and potentially fix the problem. Navigate to the universe **Nodes** tab to check the status of the nodes. You may need to replace or eliminate unresponsive nodes, or fix a lagging master process. Refer to the following sections.
57+
58+
If you fix the problem, the warning is dismissed, and YugabyteDB Anywhere returns to monitoring the universe.
59+
60+
If you need more time to investigate or fix the problem manually, you can opt to snooze the failover.
61+
62+
1. Failover is triggered if the time expires and the issue hasn't been fixed.
63+
64+
For a universe to successfully fail over masters, the following must be true:
65+
66+
- The universe is not paused.
67+
- The universe is not locked (that is, another locking operation is running).
68+
- All nodes are live; that is, there aren't any stopped, removed, or decommissioned nodes.
69+
70+
Note that master failover may not fix all the issues with the universe. Be sure to address other failed or unavailable nodes or other issues to bring your universe back to a healthy state.
71+
72+
For master failover, if the task fails, a retry is made automatically. The retry limit for failover tasks is set by the universe runtime configuration option `yb.auto_master_failover.max_task_retries`.
73+
74+
1. After starting up a new master on a different node in the same availability zone as the failed master, YugabyteDB Anywhere waits for you to recover any failed VMs, including the failed master VM, so that it can update the master address configuration on those VMs. Follow the steps in [Replace a live or unreachable node](#replace-a-live-or-unreachable-node).
75+
76+
You can set the delay for post automatic master failover using the universe runtime configuration option `yb.auto_master_failover.sync_master_addrs_task_delay`. The reference start time is calculated from the time that YugabyteDB Anywhere finds that all processes are running fine on all the VMs.
77+
78+
Post failover, there is no retry limit as it is a critical operation.
79+
1980
## Replace a live or unreachable node
2081

2182
To replace a live node for extended maintenance or replace an unhealthy node, do the following:

docs/content/v2024.2/yugabyte-platform/manage-deployments/remove-nodes.md

Lines changed: 71 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,67 @@ menu:
1212
type: docs
1313
---
1414

15+
## Automatic YB-Master failover
16+
17+
{{<tags/feature/ea>}} To avoid under-replication, YugabyteDB Anywhere can automatically detect a YB-Master server that is not responding to the master leader, or that is lagging WAL operations, and fail over to another available node in the same availability zone.
18+
19+
Note that automatic failover only works for a single unhealthy master server.
20+
21+
### Prerequisites
22+
23+
- Automatic YB-Master failover is {{<tags/feature/ea>}}. To enable the feature for a universe, set the **Auto Master Failover** Universe Runtime Configuration option (config key `yb.auto_master_failover.enabled`) to true. Refer to [Manage runtime configuration settings](../../administer-yugabyte-platform/manage-runtime-config/).
24+
- The universe has the following characteristics:
25+
26+
- running v2.20.3.0, v2.21.0.0, or later
27+
- is on-premises or on a cloud provider (Kubernetes is not supported)
28+
- has a replication factor of 3 or more
29+
- does not have dedicated masters
30+
31+
- A replacement node (running a TServer) is available in the same availability zone. (Read replica nodes are not valid for failover.)
32+
33+
### How it works
34+
35+
Automatic master failover works as follows:
36+
37+
1. When active, by default YugabyteDB Anywhere checks universe masters every minute to see if they are healthy.
38+
39+
You can customize this interval using the universe runtime configuration option `yb.auto_master_failover.detect_interval`.
40+
41+
1. YugabyteDB Anywhere declares a master is failed or potentially failed when any of the following conditions are met:
42+
43+
- The Master heartbeat delay is greater than the threshold.
44+
- Maximum tablet follower lag is greater than the threshold.
45+
46+
1. When YugabyteDB Anywhere detects an unhealthy master in a universe, it displays a message on the universe **Overview** indicating a potential master failure, and indicating the estimated time remaining until auto failover.
47+
48+
The warning is displayed when a master lags more than the threshold defined by the universe runtime configuration option `yb.auto_master_failover.master_follower_lag_soft_threshold`.
49+
50+
You can configure the time to failover using the universe runtime configuration option `yb.auto_master_failover.master_follower_lag_hard_threshold`.
51+
52+
1. During this time, you can investigate and potentially fix the problem. Navigate to the universe **Nodes** tab to check the status of the nodes. You may need to replace or eliminate unresponsive nodes, or fix a lagging master process. Refer to the following sections.
53+
54+
If you fix the problem, the warning is dismissed, and YugabyteDB Anywhere returns to monitoring the universe.
55+
56+
If you need more time to investigate or fix the problem manually, you can opt to snooze the failover.
57+
58+
1. Failover is triggered if the time expires and the issue hasn't been fixed.
59+
60+
For a universe to successfully fail over masters, the following must be true:
61+
62+
- The universe is not paused.
63+
- The universe is not locked (that is, another locking operation is running).
64+
- All nodes are live; that is, there aren't any stopped, removed, or decommissioned nodes.
65+
66+
Note that master failover may not fix all the issues with the universe. Be sure to address other failed or unavailable nodes or other issues to bring your universe back to a healthy state.
67+
68+
For master failover, if the task fails, a retry is made automatically. The retry limit for failover tasks is set by the universe runtime configuration option `yb.auto_master_failover.max_task_retries`.
69+
70+
1. After starting up a new master on a different node in the same availability zone as the failed master, YugabyteDB Anywhere waits for you to recover any failed VMs, including the failed master VM, so that it can update the master address configuration on those VMs. Follow the steps in [Replace a live or unreachable node](#replace-a-live-or-unreachable-node).
71+
72+
You can set the delay for post automatic master failover using the universe runtime configuration option `yb.auto_master_failover.sync_master_addrs_task_delay`. The reference start time is calculated from the time that YugabyteDB Anywhere finds that all processes are running fine on all the VMs.
73+
74+
Post failover, there is no retry limit as it is a critical operation.
75+
1576
## Replace a live or unreachable node
1677

1778
To replace a live node for extended maintenance or replace an unhealthy node, do the following:
@@ -21,19 +82,20 @@ To replace a live node for extended maintenance or replace an unhealthy node, do
2182

2283
![Replace Node Actions](/images/ee/replace-node.png)
2384

24-
1. Click OK to confirm.
85+
1. Click **OK** to confirm.
2586

26-
YugabyteDB Anywhere (YBA) starts the node replacement process, and you can view the progress on the **Tasks** tab. As part of the node replacement process, all data (tablets) on the existing node will be moved to other nodes to ensure that the desired replication factor is maintained throughout the operation.
87+
YugabyteDB Anywhere starts the node replacement process, and you can view the progress on the **Tasks** tab. As part of the node replacement process, all data (tablets) on the existing node will be moved to other nodes to ensure that the desired replication factor is maintained throughout the operation.
2788

28-
For cloud providers (AWS, Azure, or GCP), YBA returns the existing node back to the provider and provisions a new replacement node from the cloud provider. For on-premises universes, the existing node is returned to the [on-premises provider node pool](../../configure-yugabyte-platform/on-premises-nodes/) and a new replacement node is selected from the free pool.
89+
For cloud providers (AWS, Azure, or GCP), YugabyteDB Anywhere returns the existing node back to the provider and provisions a new replacement node from the cloud provider. For on-premises universes, the existing node is returned to the [on-premises provider node pool](../../configure-yugabyte-platform/on-premises-nodes/) and a new replacement node is selected from the free pool.
2990

30-
For on-premises universes, clean up of existing data directories and running processes may fail if the node is unhealthy. In such cases, YBA sets the state to Decommissioned. This prevents the node from being added to a new universe.
91+
For on-premises universes, clean up of existing data directories and running processes may fail if the node is unhealthy. In such cases, YugabyteDB Anywhere sets the state to Decommissioned. This prevents the node from being added to a new universe.
3192

3293
### Check on-premises node state
3394

3495
On-premises nodes have three states: In use, Free, and Decommissioned as described in the following illustration.
3596

3697
![Decommissioned node workflow](/images/ee/on-prem-replace-workflow.png)
98+
3799
To check the state of an on-premises node, navigate to **Integrations > Infrastructure > On-Premises Datacenters**, select the associated on-premises configuration, and click **Instances**.
38100

39101
### Recommission a decommissioned on-premises node
@@ -44,13 +106,13 @@ Perform the following steps to recommission a node:
44106

45107
1. Navigate to **Integrations > Infrastructure > On-Premises Datacenters**, select the associated on-premises configuration, and click **Instances**.
46108

47-
1. Under Instances, for the decommissioned node, click **Actions > Recommission Node**. YBA will now re-attempt to clean up existing data directories and processes on this node.
109+
1. Under Instances, for the decommissioned node, click **Actions > Recommission Node**. YugabyteDB Anywhere will re-attempt to clean up existing data directories and processes on this node.
48110

49111
![Recommission Node](/images/ee/recommission-node.png)
50112

51113
1. Click OK to confirm.
52114

53-
YugabyteDB Anywhere (YBA) starts the node recommissioning process, and you can view the progress on the **Tasks** tab.
115+
YugabyteDB Anywhere starts the node recommissioning process, and you can view the progress on the **Tasks** tab.
54116

55117
## Eliminate an unresponsive node
56118

@@ -134,11 +196,11 @@ A typical universe has an RF of 3 or 5. At the end of the [node removal](#remove
134196

135197
1. Click **Actions > Start Master** corresponding to the node, as per the following illustration.
136198

137-
This action is only available if there are additional nodes in the same availability zone and these nodes do not have a running Master process.
199+
This action is only available if there are additional nodes in the same availability zone and these nodes do not have a running Master process.
138200

139201
![Start master](/images/yp/start-master.png)
140202

141-
When you execute the start Master action, YugabyteDB Anywhere performs the following:
203+
When you execute the start Master action, YugabyteDB Anywhere performs the following:
142204

143205
1. Configures the Master on the subject node.
144206

@@ -202,7 +264,7 @@ In some cases, depending on the node's status, YugabyteDB Anywhere allows you to
202264

203265
1. Find a node with a Decommissioned status and click its corresponding **Actions > Add Node**, as per the following illustration:
204266

205-
![Add Node Actions](/images/ee/node-actions-add-node.png)
267+
![Add Node Actions](/images/ee/node-actions-add-node.png)
206268

207269
For Infrastructure as a service (IaaS) such as AWS and GCP, YugabyteDB Anywhere will spawn with the existing node instance type in the existing region and zone of that node. When the process completes, the node will have the Master and TServer processes running, along with data that is load-balanced onto this node. The node's name will be reused and the status will be shown as Live.
208270

0 commit comments

Comments
 (0)