[doc][yba][DOC-503][2024.2] Updates to YBA xCluster (yugabyte#24847)

ddhodge · aishwarya24 · web-flow · commit b426a72bfc53 · 2024-11-19T21:20:31.000-05:00
* updates to YBA xCluster

* misc edits

* format

* review comments

* Apply suggestions from code review

Co-authored-by: Aishwarya Chakravarthy  &lt;achakravarthy@yugabyte.com&gt;

* review comments

* review comment

* copy to preview

* fix stable

---------

Co-authored-by: Aishwarya Chakravarthy &lt;achakravarthy@yugabyte.com&gt;
diff --git a/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/_index.md b/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/_index.md
@@ -66,11 +66,7 @@ Blog: [Using YugabyteDB xCluster DR for PostgreSQL Disaster Recovery in Azure](h
 
 - Currently, replication of DDL (SQL-level changes such as creating or dropping tables or indexes) is not supported. To make these changes requires first performing the DDL operation (for example, creating a table), and then adding the new object to replication in YugabyteDB Anywhere. In addition, xCluster does not support truncate operations. Refer to [Manage tables and indexes](./disaster-recovery-tables/).
 
-- DR setup (and other operations that require making a full copy from DR primary to DR replica, such as adding tables with data to replication, resuming replication after an extended network outage, and so on) may fail with the error `database "<database_name>" is being accessed by other users`.
-
-    This happens because the operation relies on a backup and restore of the database, and the restore will fail if there are any open connections to the DR replica.
-
-    To fix this, close any open SQL connections to the DR replica, delete the DR configuration, and perform the operation again.
+- If a database operation requires a full copy, any application sessions on the database on the DR target will be interrupted while the database is dropped and recreated. Your application should either retry connections or redirect reads to the DR primary.
 
 - Setting up DR between a universe upgraded to v2.20.x and a new v2.20.x universe is not supported. This is due to a limitation of xCluster deployments and packed rows. See [Packed row limitations](../../../architecture/docdb/packed-rows/#limitations).
 
diff --git a/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-failover.md b/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-failover.md
@@ -22,7 +22,7 @@ If the DR primary is terminated for some reason, do the following:
 
 1. Stop the application traffic to ensure no more updates are attempted.
 
-1. Navigate to your DR primary universe and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 
 1. Note the **Potential data loss on failover** to understand the extent of possible data loss as a result of the outage, and determine if the extent of data loss is acceptable for your situation.
 
@@ -54,7 +54,7 @@ In both cases, repairing DR involves making a full copy of the databases through
 
 To repair DR, do the following:
 
-1. Navigate to your (new) DR primary universe and select **xCluster Disaster Recovery**.
+1. Navigate to your (new) DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 
 1. Click **Repair DR** to display the **Repair DR** dialog.
 
diff --git a/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md b/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md
@@ -20,7 +20,9 @@ Switchover can be used by enterprises when performing regular business continuit
 
 First, confirm there is no excessive lag between the DR primary and replica. You can [monitor lag](../disaster-recovery-setup/#monitor-replication) on the **xCluster Disaster Recovery** tab.
 
-If the DR config has any tables that don't have a replication status of Operational, switchover will be unsuccessful. In that case, you can do one of the following:
+While the switchover task is in progress, both universes are in read-only mode and reject write operations.
+
+If the DR configuration has any tables that don't have a replication status of Operational, switchover will be unsuccessful. In that case, you can do one of the following:
 
 - Perform a full copy from the DR primary to the DR replica.
 - [Unplanned Failover](../disaster-recovery-failover/).
@@ -33,7 +35,7 @@ Use the following steps to perform a planned switchover:
 
 1. Stop the application traffic on the DR primary.
 
-1. Navigate to your DR primary universe and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 
 1. Click **Actions** and choose **Switchover**.
 
diff --git a/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-tables.md b/docs/content/preview/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-tables.md
@@ -52,7 +52,7 @@ Add tables to DR in the following sequence:
 
 1. Create the table on the DR primary (if it doesn't already exist).
 1. Create the table on the DR replica.
-1. Navigate to your DR primary and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 1. Click **Actions** and choose **Select Databases and Tables**.
 1. Select the tables and click **Validate Selection**.
 1. If data needs to be copied, click **Next: Confirm Full Copy**.
@@ -74,7 +74,7 @@ When dropping a table, remove the table from DR before dropping the table in the
 
 Remove tables from DR in the following sequence:
 
-1. Navigate to your DR primary and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 1. Click **Actions** and choose **Select Databases and Tables**.
 1. Deselect the tables and click **Validate Selection**.
 1. Click **Apply Changes**.
@@ -157,5 +157,5 @@ To remove a table partition from DR, follow the same steps as [Remove a table fr
 
 To ensure changes made outside of YugabyteDB Anywhere are reflected in YugabyteDB Anywhere, you need to reconcile the configuration as follows:
 
-1. In YugabyteDB Anywhere, navigate to your DR primary and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 1. Click **Actions > Advanced** and choose **Reconcile Config with Database**.
diff --git a/docs/content/preview/yugabyte-platform/manage-deployments/xcluster-replication/xcluster-replication-setup.md b/docs/content/preview/yugabyte-platform/manage-deployments/xcluster-replication/xcluster-replication-setup.md
@@ -14,6 +14,8 @@ type: docs
 
 ## Prerequisites
 
+To set up or configure xCluster replication, you must be a Super Admin or Admin, or have a role with the Manage xCluster permission. For information on roles, refer to [Manage users](../../../administer-yugabyte-platform/anywhere-rbac/).
+
 Create the source and target universes for replication.
 
 Ensure the universes have the following characteristics:
@@ -215,7 +217,14 @@ To create an alert:
 
 1. Click **Save** when you are done.
 
-When replication is set up, YugabyteDB automatically creates an alert for _YSQL Tables in DR/xCluster Config Inconsistent With Primary/Source_. This alert fires when tables are added or dropped from source's databases under replication, but are not yet added or dropped from the YugabyteDB Anywhere replication configuration.
+When replication is set up, YugabyteDB automatically creates the alert _XCluster Config Tables are in bad state_. This alert fires when:
+
+- there is a table schema mismatch between DR primary and replica
+- tables are added or dropped from either DR primary or replica, but have not been added or dropped from the other.
+
+When you receive an alert, navigate to the replication configuration [Tables tab](#tables) to see the table status.
+
+YugabyteDB Anywhere collects these metrics every 2 minutes, and fires the alert within 10 minutes of the error.
 
 For more information on alerting in YugabyteDB Anywhere, refer to [Alerts](../../../alerts-monitoring/alert/).
 
diff --git a/docs/content/stable/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md b/docs/content/stable/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md
@@ -20,7 +20,9 @@ Switchover can be used by enterprises when performing regular business continuit
 
 First, confirm there is no excessive lag between the DR primary and replica. You can [monitor lag](../disaster-recovery-setup/#monitor-replication) on the **xCluster Disaster Recovery** tab.
 
-If the DR config has any tables that don't have a replication status of Operational, switchover will be unsuccessful. In that case, you can do one of the following:
+While the switchover task is in progress, both universes are in read-only mode and reject write operations.
+
+If the DR configuration has any tables that don't have a replication status of Operational, switchover will be unsuccessful. In that case, you can do one of the following:
 
 - Perform a full copy from the DR primary to the DR replica.
 - [Unplanned Failover](../disaster-recovery-failover/).
diff --git a/docs/content/v2024.2/yugabyte-platform/alerts-monitoring/alert-policy-templates.md b/docs/content/v2024.2/yugabyte-platform/alerts-monitoring/alert-policy-templates.md
@@ -83,6 +83,15 @@ Last snapshot task for universe `'$universe_name'` failed. To retry, check PITR
 min(ybp_pitr_config_status{universe_uuid = "__universeUuid__"}) {{ query_condition }} 1
 ```
 
+#### xCluster config tables are in bad state
+
+There are issues with table replication. For example, there is a table schema mismatch between primary and replica universes in xCluster replication or xCluster DR; or tables were added or dropped from either primary or replica, but have not been added or dropped from the other; or there was an extended network partition.
+
+```promql
+min(last_over_time(ybp_xcluster_table_status{source_universe_uuid
+= "__universeUuid__"}[5m])) {{ query_condition }} {{ query_threshold }}
+```
+
 ### DB templates
 
 #### DB compaction overload
@@ -659,14 +668,6 @@ YSQLSH connection failure detected for universe `'$universe_name'` on `$value` T
 count by (universe_uuid) (yb_node_ysql_connect{universe_uuid="__universeUuid__"} < 1) {{ query_condition }} {{ query_threshold }}
 ```
 
-#### New YSQL tables added
-
-New YSQL tables are added to the source universe `'$universe_name'` in the database with an existing xCluster configuration, but not added to the xCluster replication.
-
-```promql
-((count by (namespace_name, universe_uuid)(count by(namespace_name, table_id, universe_uuid)(rocksdb_current_version_sst_files_size{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) - count by(namespace_name, universe_uuid)(count by(namespace_name, universe_uuid, table_id)(async_replication_sent_lag_micros{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) {{ query_condition }} {{ query_threshold }}
-```
-
 #### Number of YSQL connections is high
 
 Number of YSQL connections for universe `'$universe_name'` is above `$threshold`. Current value is `$value`.
diff --git a/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/_index.md b/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/_index.md
@@ -66,11 +66,7 @@ Blog: [Using YugabyteDB xCluster DR for PostgreSQL Disaster Recovery in Azure](h
 
 - Currently, replication of DDL (SQL-level changes such as creating or dropping tables or indexes) is not supported. To make these changes requires first performing the DDL operation (for example, creating a table), and then adding the new object to replication in YugabyteDB Anywhere. In addition, xCluster does not support truncate operations. Refer to [Manage tables and indexes](./disaster-recovery-tables/).
 
-- DR setup (and other operations that require making a full copy from DR primary to DR replica, such as adding tables with data to replication, resuming replication after an extended network outage, and so on) may fail with the error `database "<database_name>" is being accessed by other users`.
-
-    This happens because the operation relies on a backup and restore of the database, and the restore will fail if there are any open connections to the DR replica.
-
-    To fix this, close any open SQL connections to the DR replica, delete the DR configuration, and perform the operation again.
+- If a database operation requires a full copy, any application sessions on the database on the DR target will be interrupted while the database is dropped and recreated. Your application should either retry connections or redirect reads to the DR primary.
 
 - Setting up DR between a universe upgraded to v2.20.x and a new v2.20.x universe is not supported. This is due to a limitation of xCluster deployments and packed rows. See [Packed row limitations](../../../architecture/docdb/packed-rows/#limitations).
 
diff --git a/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-failover.md b/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-failover.md
@@ -22,7 +22,7 @@ If the DR primary is terminated for some reason, do the following:
 
 1. Stop the application traffic to ensure no more updates are attempted.
 
-1. Navigate to your DR primary universe and select **xCluster Disaster Recovery**.
+1. Navigate to your DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 
 1. Note the **Potential data loss on failover** to understand the extent of possible data loss as a result of the outage, and determine if the extent of data loss is acceptable for your situation.
 
@@ -54,7 +54,7 @@ In both cases, repairing DR involves making a full copy of the databases through
 
 To repair DR, do the following:
 
-1. Navigate to your (new) DR primary universe and select **xCluster Disaster Recovery**.
+1. Navigate to your (new) DR primary universe **xCluster Disaster Recovery** tab and select the replication configuration.
 
 1. Click **Repair DR** to display the **Repair DR** dialog.
 
diff --git a/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-setup.md b/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-setup.md
diff --git a/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md b/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-switchover.md
diff --git a/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-tables.md b/docs/content/v2024.2/yugabyte-platform/back-up-restore-universes/disaster-recovery/disaster-recovery-tables.md
diff --git a/docs/content/v2024.2/yugabyte-platform/manage-deployments/xcluster-replication/xcluster-replication-setup.md b/docs/content/v2024.2/yugabyte-platform/manage-deployments/xcluster-replication/xcluster-replication-setup.md