From 3b5131ad1986b81d3da7e6ec26ca7e9e90cabbda Mon Sep 17 00:00:00 2001 From: houfaxin Date: Thu, 6 Nov 2025 10:52:13 +0800 Subject: [PATCH 1/8] Add docs for DDL-embedded Analyze and new system variable Introduces the 'Analyze Embedded in DDL' feature documentation and adds a new page describing its behavior for index creation and reorganization. Updates TOC to include the new doc and documents the 'tidb_stats_update_during_ddl' system variable, which controls this feature. --- TOC.md | 3 +- ddl_embedded_analyze.md | 175 ++++++++++++++++++++++++++++++++++++++++ system-variables.md | 8 ++ 3 files changed, 185 insertions(+), 1 deletion(-) create mode 100644 ddl_embedded_analyze.md diff --git a/TOC.md b/TOC.md index 2065601e434f4..035aecd6d772e 100644 --- a/TOC.md +++ b/TOC.md @@ -1078,7 +1078,8 @@ - [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) - [URI Formats of External Storage Services](/external-storage-uri.md) - [Interaction Test on Online Workloads and `ADD INDEX` Operations](/benchmark/online-workloads-and-add-index-operations.md) -- FAQs + - [Analyze Embedded in DDL](/ddl_embedded_analyze.md) + FAQs - [FAQ Summary](/faq/faq-overview.md) - [TiDB FAQs](/faq/tidb-faq.md) - [SQL FAQs](/faq/sql-faq.md) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md new file mode 100644 index 0000000000000..7ee9bbbb5b9a2 --- /dev/null +++ b/ddl_embedded_analyze.md @@ -0,0 +1,175 @@ +--- +title: Analyze Embedded in DDL +summary: This document describes the Analyze feature embedded in DDL for newly created or reorganized indexes, which ensures that statistics for new indexes are updated promptly. +--- + +# Analyze Embedded in DDL Introduced in v8.5.4 and v9.0.0 + +This document describes the Analyze feature embedded in the following two types of DDL: + +- DDL that creates new indexes: [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) +- DDLs that reorganize existing indexes: [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) + +When this feature is enabled, TiDB automatically runs an Analyze (statistics collection) operation before the new or reorganized index becomes visible to users. This prevents inaccurate optimizer estimates and potential plan changes caused by temporarily unavailable statistics after index creation or reorganization. + +## Use scenarios + +In scenarios where DDL operations alternately add or modify indexes, existing stable queries might suffer from estimation bias because the new index lacks statistics, causing the optimizer to choose suboptimal plans. For more information, see [Issue #57948](https://github.com/pingcap/tidb/issues/57948). + +For example: + +```sql +CREATE TABLE t (a INT, b INT); +INSERT INTO t VALUES (1, 1), (2, 2), (3, 3); +INSERT INTO t SELECT * FROM t; -- * N times + +ALTER TABLE t ADD INDEX idx_a (a); + +EXPLAIN SELECT * FROM x WHERE a > 4; +``` + +``` ++-------------------------+-----------+-----------+---------------+--------------------------------+ +| id | estRows | task | access object | operator info | ++-------------------------+-----------+-----------+---------------+--------------------------------+ +| TableReader_8 | 131072.00 | root | | data:Selection_7 | +| └─Selection_7 | 131072.00 | cop[tikv] | | gt(test.x.a, 4) | +| └─TableFullScan_6 | 393216.00 | cop[tikv] | table:x | keep order:false, stats:pseudo | ++-------------------------+-----------+-----------+---------------+--------------------------------+ +3 rows in set (0.002 sec) +``` + +In the preceding plan, because the newly created index has no statistics yet, TiDB can only rely on heuristic rules for path estimation. Unless the index access path requires no table lookup and has a significantly lower cost, the optimizer tends to choose the more stable existing path. In the preceding example, it chooses a full table scan. However, from the data distribution perspective, `x.a > 4` actually returns 0 rows. If the new index `idx_a` were used, the query could quickly locate relevant rows and avoid the full table scan. In this example, because statistics are not promptly collected after the DDL creates the index, the generated plan is not optimal, but the optimizer continues to use the original plan so query performance does not sharply regress. However, according to [Issue #57948](https://github.com/pingcap/tidb/issues/57948), in some cases heuristics might cause an unreasonable comparison between old and new indexes, pruning the index that the original plan relies on and ultimately falling back to a full table scan. + +Starting from v8.5.0, TiDB has improved heuristic comparisons between indexes and behaviors when statistics are missing. Still, in some complex scenarios, embedding Analyze in DDL is the best way to prevent plan changes. You can control whether to run embedded Analyze during index creation or reorganization with the system variable [`tidb_stats_update_during_ddl`](/system-variables.md#tidb_stats_update_during_ddl-new-in-v854-and-v900). The default value is `OFF`. + +## `ADD INDEX` DDL + +When `tidb_stats_update_during_ddl` is `ON`, executing [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) automatically runs an embedded Analyze after the Reorg phase finishes. This Analyze collects statistics for the newly created index before the index becomes visible to users, and then `ADD INDEX` proceeds with its remaining phases. + +Considering that Analyze can take time, TiDB sets a timeout threshold based on the execution time of the first Reorg. If Analyze times out, `ADD INDEX` will stop waiting synchronously for Analyze to finish and will continue the subsequent process so that the index becomes visible earlier to users. This means the index statistics will be updated after Analyze completes asynchronously. + +Example: + +```sql +CREATE TABLE t (a INT, b INT, c INT); +Query OK, 0 rows affected (0.011 sec) + +INSERT INTO t VALUES (1, 1, 1), (2, 2, 2), (3, 3, 3); +Query OK, 3 rows affected (0.003 sec) +Records: 3 Duplicates: 0 Warnings: 0 + +SET @@tidb_stats_update_during_ddl = 1; +Query OK, 0 rows affected (0.001 sec) + +ALTER TABLE t ADD INDEX idx (a, b); +Query OK, 0 rows affected (0.049 sec) +``` + +```sql +EXPLAIN SELECT a FROM t WHERE a > 1; +``` + +``` ++------------------------+---------+-----------+--------------------------+----------------------------------+ +| id | estRows | task | access object | operator info | ++------------------------+---------+-----------+--------------------------+----------------------------------+ +| IndexReader_7 | 4.00 | root | | index:IndexRangeScan_6 | +| └─IndexRangeScan_6 | 4.00 | cop[tikv] | table:t, index:idx(a, b) | range:(1,+inf], keep order:false | ++------------------------+---------+-----------+--------------------------+----------------------------------+ +2 rows in set (0.002 sec) +``` + +```sql +SHOW STATS_HISTOGRAMS WHERE table_name = "t"; +``` + +``` ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +| Db_name | Table_name | Partition_name | Column_name | Is_index | Update_time | Distinct_count | Null_count | Avg_col_size | Correlation | Load_status | Total_mem_usage | Hist_mem_usage | Topn_mem_usage | Cms_mem_usage | ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +| test | t | | a | 0 | 2025-10-30 20:17:57 | 3 | 0 | 0.5 | 1 | allLoaded | 155 | 0 | 155 | 0 | +| test | t | | idx | 1 | 2025-10-30 20:17:57 | 3 | 0 | 0 | 0 | allLoaded | 182 | 0 | 182 | 0 | ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +2 rows in set (0.013 sec) +``` + +```sql +ADMIN SHOW DDL JOBS 1; +``` + +``` ++--------+---------+--------------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+----------------------------------------+ +| JOB_ID | DB_NAME | TABLE_NAME | JOB_TYPE | SCHEMA_STATE | SCHEMA_ID | TABLE_ID | ROW_COUNT | CREATE_TIME | START_TIME | END_TIME | STATE | COMMENTS | ++--------+---------+--------------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+----------------------------------------+ +| 151 | test | t | add index | write reorganization | 2 | 148 | 6291456 | 2025-10-29 00:14:47.181000 | 2025-10-29 00:14:47.183000 | NULL | running | analyzing, txn-merge, max_node_count=3 | ++--------+---------+--------------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+----------------------------------------+ +1 rows in set (0.001 sec) +``` + +In the `ADD INDEX` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the subsequent `EXPLAIN`, the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by running `SHOW STATS_HISTOGRAMS`). Therefore, the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and Analyze take long, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it means that the DDL job is collecting statistics. + +## DDL for reorganizing existing indexes + +When `tidb_stats_update_during_ddl` is `ON`, executing [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) or [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) that reorganizes an index will also run an embedded Analyze after the Reorg phase completes. The mechanism is the same as for `ADD INDEX`: + +- Start collecting statistics before the index becomes visible. +- If Analyze times out, [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) will not synchronously wait for Analyze to finish and will continue so the index becomes visible earlier to users. This means that the index statistics will be updated when Analyze finishes asynchronously. + +```sql +CREATE TABLE s (a VARCHAR(10), INDEX idx (a)); +Query OK, 0 rows affected (0.012 sec) + +INSERT INTO s VALUES (1), (2), (3); +Query OK, 3 rows affected (0.003 sec) +Records: 3 Duplicates: 0 Warnings: 0 + +SET @@tidb_stats_update_during_ddl = 1; +Query OK, 0 rows affected (0.001 sec) + +ALTER TABLE s MODIFY COLUMN a INT; +Query OK, 0 rows affected (0.056 sec) + +EXPLAIN SELECT * FROM s WHERE a > 1; +``` + +``` ++------------------------+---------+-----------+-----------------------+----------------------------------+ +| id | estRows | task | access object | operator info | ++------------------------+---------+-----------+-----------------------+----------------------------------+ +| IndexReader_7 | 2.00 | root | | index:IndexRangeScan_6 | +| └─IndexRangeScan_6 | 2.00 | cop[tikv] | table:s, index:idx(a) | range:(1,+inf], keep order:false | ++------------------------+---------+-----------+-----------------------+----------------------------------+ +2 rows in set (0.005 sec) +``` + +```sql +SHOW STATS_HISTOGRAMS WHERE table_name = "s"; +``` + +``` ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +| Db_name | Table_name | Partition_name | Column_name | Is_index | Update_time | Distinct_count | Null_count | Avg_col_size | Correlation | Load_status | Total_mem_usage | Hist_mem_usage | Topn_mem_usage | Cms_mem_usage | ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +| test | s | | a | 0 | 2025-10-30 20:10:18 | 3 | 0 | 2 | 1 | allLoaded | 158 | 0 | 158 | 0 | +| test | s | | a | 0 | 2025-10-30 20:10:18 | 3 | 0 | 1 | 1 | allLoaded | 155 | 0 | 155 | 0 | +| test | s | | idx | 1 | 2025-10-30 20:10:18 | 3 | 0 | 0 | 0 | allLoaded | 158 | 0 | 158 | 0 | +| test | s | | idx | 1 | 2025-10-30 20:10:18 | 3 | 0 | 0 | 0 | allLoaded | 155 | 0 | 155 | 0 | ++---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+-------------+-----------------+----------------+----------------+---------------+ +4 rows in set (0.008 sec) +``` + +```sql +ADMIN SHOW DDL JOBS 1; +``` + +``` ++--------+---------+------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+-----------------------------+ +| JOB_ID | DB_NAME | TABLE_NAME | JOB_TYPE | SCHEMA_STATE | SCHEMA_ID | TABLE_ID | ROW_COUNT | CREATE_TIME | START_TIME | END_TIME | STATE | COMMENTS | ++--------+---------+------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+-----------------------------+ +| 153 | test | s | modify column | write reorganization | 2 | 148 | 12582912 | 2025-10-29 00:26:49.240000 | 2025-10-29 00:26:49.244000 | NULL | running | analyzing | ++--------+---------+------------------+---------------+----------------------+-----------+----------+-----------+----------------------------+----------------------------+----------------------------+---------+-----------------------------+ +1 rows in set (0.001 sec) +``` + +From the `MODIFY COLUMN` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the following `EXPLAIN` the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`), so the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and Analyze take long, check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it indicates that the DDL job is collecting statistics. diff --git a/system-variables.md b/system-variables.md index 6ad5c710e3ace..40e1df7a28be9 100644 --- a/system-variables.md +++ b/system-variables.md @@ -1624,6 +1624,14 @@ mysql> SELECT job_info FROM mysql.analyze_jobs ORDER BY end_time DESC LIMIT 1; +### tidb_stats_update_during_ddl New in v8.5.4 and v9.0.0 + +- Scope: GLOBAL +- Persists to cluster: Yes +- Applies to hint [SET_VAR](/optimizer-hints.md#set_varvar_namevar_value): No +- Default value: `OFF` +- This variable controls whether to enable DDL-embedded Analyze. When enabled, DDL statements that create new indexes ([`ADD INDEX`](/sql-statements/sql-statement-add-index.md) and DDLs that reorganize existing indexes ([`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md)) automatically run statistics collection before the index becomes visible. For more information, see [DDL-Embedded Analyze](/ddl_embedded_analyze.md). + ### tidb_enable_dist_task New in v7.1.0 - Scope: GLOBAL From 16d18a59498cbcae0c6ba1a27e790f95b2e62b90 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Thu, 6 Nov 2025 11:15:55 +0800 Subject: [PATCH 2/8] Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- TOC.md | 2 +- ddl_embedded_analyze.md | 28 ++++++++++++++-------------- system-variables.md | 2 +- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/TOC.md b/TOC.md index 035aecd6d772e..8e7fb55126249 100644 --- a/TOC.md +++ b/TOC.md @@ -1078,7 +1078,7 @@ - [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) - [URI Formats of External Storage Services](/external-storage-uri.md) - [Interaction Test on Online Workloads and `ADD INDEX` Operations](/benchmark/online-workloads-and-add-index-operations.md) - - [Analyze Embedded in DDL](/ddl_embedded_analyze.md) + - [`ANALYZE` Embedded in DDL Statements](/ddl_embedded_analyze.md) FAQs - [FAQ Summary](/faq/faq-overview.md) - [TiDB FAQs](/faq/tidb-faq.md) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index 7ee9bbbb5b9a2..029e2fb859f34 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -1,16 +1,16 @@ --- -title: Analyze Embedded in DDL -summary: This document describes the Analyze feature embedded in DDL for newly created or reorganized indexes, which ensures that statistics for new indexes are updated promptly. +title: `ANALYZE` Embedded in DDL +summary: This document describes the `ANALYZE` feature embedded in DDL statements for newly created or reorganized indexes, which ensures that statistics for new indexes are updated promptly. --- -# Analyze Embedded in DDL Introduced in v8.5.4 and v9.0.0 +# `ANALYZE` Embedded in DDL StatementsIntroduced in v8.5.4 and v9.0.0 -This document describes the Analyze feature embedded in the following two types of DDL: +This document describes the `ANALYZE` feature embedded in the following two types of DDL statements: -- DDL that creates new indexes: [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) -- DDLs that reorganize existing indexes: [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) +- DDL statements that create new indexes: [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) +- DDL statements that reorganize existing indexes: [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) -When this feature is enabled, TiDB automatically runs an Analyze (statistics collection) operation before the new or reorganized index becomes visible to users. This prevents inaccurate optimizer estimates and potential plan changes caused by temporarily unavailable statistics after index creation or reorganization. +When this feature is enabled, TiDB automatically runs an `ANALYZE` (statistics collection) operation before the new or reorganized index becomes visible to users. This prevents inaccurate optimizer estimates and potential plan changes caused by temporarily unavailable statistics after index creation or reorganization. ## Use scenarios @@ -41,13 +41,13 @@ EXPLAIN SELECT * FROM x WHERE a > 4; In the preceding plan, because the newly created index has no statistics yet, TiDB can only rely on heuristic rules for path estimation. Unless the index access path requires no table lookup and has a significantly lower cost, the optimizer tends to choose the more stable existing path. In the preceding example, it chooses a full table scan. However, from the data distribution perspective, `x.a > 4` actually returns 0 rows. If the new index `idx_a` were used, the query could quickly locate relevant rows and avoid the full table scan. In this example, because statistics are not promptly collected after the DDL creates the index, the generated plan is not optimal, but the optimizer continues to use the original plan so query performance does not sharply regress. However, according to [Issue #57948](https://github.com/pingcap/tidb/issues/57948), in some cases heuristics might cause an unreasonable comparison between old and new indexes, pruning the index that the original plan relies on and ultimately falling back to a full table scan. -Starting from v8.5.0, TiDB has improved heuristic comparisons between indexes and behaviors when statistics are missing. Still, in some complex scenarios, embedding Analyze in DDL is the best way to prevent plan changes. You can control whether to run embedded Analyze during index creation or reorganization with the system variable [`tidb_stats_update_during_ddl`](/system-variables.md#tidb_stats_update_during_ddl-new-in-v854-and-v900). The default value is `OFF`. +Starting from v8.5.0, TiDB has improved heuristic comparisons between indexes and behaviors when statistics are missing. Still, in some complex scenarios, embedding `ANALYZE` in DDL is the best way to prevent plan changes. You can control whether to run embedded `ANALYZE` during index creation or reorganization with the system variable [`tidb_stats_update_during_ddl`](/system-variables.md#tidb_stats_update_during_ddl-new-in-v854-and-v900). The default value is `OFF`. ## `ADD INDEX` DDL -When `tidb_stats_update_during_ddl` is `ON`, executing [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) automatically runs an embedded Analyze after the Reorg phase finishes. This Analyze collects statistics for the newly created index before the index becomes visible to users, and then `ADD INDEX` proceeds with its remaining phases. +When `tidb_stats_update_during_ddl` is `ON`, executing [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) automatically runs an embedded `ANALYZE` operation after the Reorg phase finishes. This `ANALYZE` operation collects statistics for the newly created index before the index becomes visible to users, and then `ADD INDEX` proceeds with its remaining phases. -Considering that Analyze can take time, TiDB sets a timeout threshold based on the execution time of the first Reorg. If Analyze times out, `ADD INDEX` will stop waiting synchronously for Analyze to finish and will continue the subsequent process so that the index becomes visible earlier to users. This means the index statistics will be updated after Analyze completes asynchronously. +Considering that `ANALYZE` can take time, TiDB sets a timeout threshold based on the execution time of the first Reorg. If `ANALYZE` times out, `ADD INDEX` will stop waiting synchronously for `ANALYZE` to finish and will continue the subsequent process so that the index becomes visible earlier to users. This means the index statistics will be updated after `ANALYZE` completes asynchronously. Example: @@ -107,14 +107,14 @@ ADMIN SHOW DDL JOBS 1; 1 rows in set (0.001 sec) ``` -In the `ADD INDEX` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the subsequent `EXPLAIN`, the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by running `SHOW STATS_HISTOGRAMS`). Therefore, the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and Analyze take long, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it means that the DDL job is collecting statistics. +In the `ADD INDEX` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the subsequent `EXPLAIN`, the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by running `SHOW STATS_HISTOGRAMS`). Therefore, the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and `ANALYZE` take a long time, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it means that the DDL job is collecting statistics. ## DDL for reorganizing existing indexes -When `tidb_stats_update_during_ddl` is `ON`, executing [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) or [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) that reorganizes an index will also run an embedded Analyze after the Reorg phase completes. The mechanism is the same as for `ADD INDEX`: +When `tidb_stats_update_during_ddl` is `ON`, executing [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) or [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) that reorganizes an index will also run an embedded `ANALYZE` operation after the Reorg phase completes. The mechanism is the same as for `ADD INDEX`: - Start collecting statistics before the index becomes visible. -- If Analyze times out, [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) will not synchronously wait for Analyze to finish and will continue so the index becomes visible earlier to users. This means that the index statistics will be updated when Analyze finishes asynchronously. +- If `ANALYZE` times out, [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) will not synchronously wait for `ANALYZE` to finish and will continue so the index becomes visible earlier to users. This means that the index statistics will be updated when `ANALYZE` finishes asynchronously. ```sql CREATE TABLE s (a VARCHAR(10), INDEX idx (a)); @@ -172,4 +172,4 @@ ADMIN SHOW DDL JOBS 1; 1 rows in set (0.001 sec) ``` -From the `MODIFY COLUMN` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the following `EXPLAIN` the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`), so the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and Analyze take long, check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it indicates that the DDL job is collecting statistics. +From the `MODIFY COLUMN` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the following `EXPLAIN` the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`), so the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and `ANALYZE` take a long time, check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it indicates that the DDL job is collecting statistics. diff --git a/system-variables.md b/system-variables.md index 40e1df7a28be9..5a6511e40e737 100644 --- a/system-variables.md +++ b/system-variables.md @@ -1630,7 +1630,7 @@ mysql> SELECT job_info FROM mysql.analyze_jobs ORDER BY end_time DESC LIMIT 1; - Persists to cluster: Yes - Applies to hint [SET_VAR](/optimizer-hints.md#set_varvar_namevar_value): No - Default value: `OFF` -- This variable controls whether to enable DDL-embedded Analyze. When enabled, DDL statements that create new indexes ([`ADD INDEX`](/sql-statements/sql-statement-add-index.md) and DDLs that reorganize existing indexes ([`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md)) automatically run statistics collection before the index becomes visible. For more information, see [DDL-Embedded Analyze](/ddl_embedded_analyze.md). +- This variable controls whether to enable DDL-embedded `ANALYZE`. When enabled, DDL statements that create new indexes ([`ADD INDEX`](/sql-statements/sql-statement-add-index.md)) and DDL statements that reorganize existing indexes ([`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md)) automatically run statistics collection before the index becomes visible. For more information, see [`ANALYZE` Embedded in DDL Statements](/ddl_embedded_analyze.md). ### tidb_enable_dist_task New in v7.1.0 From efc49b0823cdac4a8f3799cb748dce9238cb76f1 Mon Sep 17 00:00:00 2001 From: Arenatlx Date: Thu, 6 Nov 2025 15:27:57 +0800 Subject: [PATCH 3/8] Update ddl_embedded_analyze.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- ddl_embedded_analyze.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index 029e2fb859f34..49d9e7536316a 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -25,7 +25,7 @@ INSERT INTO t SELECT * FROM t; -- * N times ALTER TABLE t ADD INDEX idx_a (a); -EXPLAIN SELECT * FROM x WHERE a > 4; +EXPLAIN SELECT * FROM t WHERE a > 4; ``` ``` From 9db4f6a1260c8f0b80fca4bc8b90b042aae7eb8d Mon Sep 17 00:00:00 2001 From: Arenatlx Date: Thu, 6 Nov 2025 19:24:25 +0800 Subject: [PATCH 4/8] Update ddl_embedded_analyze.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- ddl_embedded_analyze.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index 49d9e7536316a..13cbb2e0b1b4b 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -33,8 +33,8 @@ EXPLAIN SELECT * FROM t WHERE a > 4; | id | estRows | task | access object | operator info | +-------------------------+-----------+-----------+---------------+--------------------------------+ | TableReader_8 | 131072.00 | root | | data:Selection_7 | -| └─Selection_7 | 131072.00 | cop[tikv] | | gt(test.x.a, 4) | -| └─TableFullScan_6 | 393216.00 | cop[tikv] | table:x | keep order:false, stats:pseudo | +| └─Selection_7 | 131072.00 | cop[tikv] | | gt(test.t.a, 4) | +| └─TableFullScan_6 | 393216.00 | cop[tikv] | table:t | keep order:false, stats:pseudo | +-------------------------+-----------+-----------+---------------+--------------------------------+ 3 rows in set (0.002 sec) ``` From de4990501a421b09e8d8a8d81fd66ceb310399cd Mon Sep 17 00:00:00 2001 From: Arenatlx Date: Thu, 6 Nov 2025 19:24:38 +0800 Subject: [PATCH 5/8] Update ddl_embedded_analyze.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- ddl_embedded_analyze.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index 13cbb2e0b1b4b..e92eddf85edf2 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -39,7 +39,7 @@ EXPLAIN SELECT * FROM t WHERE a > 4; 3 rows in set (0.002 sec) ``` -In the preceding plan, because the newly created index has no statistics yet, TiDB can only rely on heuristic rules for path estimation. Unless the index access path requires no table lookup and has a significantly lower cost, the optimizer tends to choose the more stable existing path. In the preceding example, it chooses a full table scan. However, from the data distribution perspective, `x.a > 4` actually returns 0 rows. If the new index `idx_a` were used, the query could quickly locate relevant rows and avoid the full table scan. In this example, because statistics are not promptly collected after the DDL creates the index, the generated plan is not optimal, but the optimizer continues to use the original plan so query performance does not sharply regress. However, according to [Issue #57948](https://github.com/pingcap/tidb/issues/57948), in some cases heuristics might cause an unreasonable comparison between old and new indexes, pruning the index that the original plan relies on and ultimately falling back to a full table scan. +In the preceding plan, because the newly created index has no statistics yet, TiDB can only rely on heuristic rules for path estimation. Unless the index access path requires no table lookup and has a significantly lower cost, the optimizer tends to choose the more stable existing path. In the preceding example, it chooses a full table scan. However, from the data distribution perspective, `t.a > 4` actually returns 0 rows. If the new index `idx_a` were used, the query could quickly locate relevant rows and avoid the full table scan. In this example, because statistics are not promptly collected after the DDL creates the index, the generated plan is not optimal, but the optimizer continues to use the original plan so query performance does not sharply regress. However, according to [Issue #57948](https://github.com/pingcap/tidb/issues/57948), in some cases heuristics might cause an unreasonable comparison between old and new indexes, pruning the index that the original plan relies on and ultimately falling back to a full table scan. Starting from v8.5.0, TiDB has improved heuristic comparisons between indexes and behaviors when statistics are missing. Still, in some complex scenarios, embedding `ANALYZE` in DDL is the best way to prevent plan changes. You can control whether to run embedded `ANALYZE` during index creation or reorganization with the system variable [`tidb_stats_update_during_ddl`](/system-variables.md#tidb_stats_update_during_ddl-new-in-v854-and-v900). The default value is `OFF`. From f119a11fcf67f6092ba2e4e5378ff1bad5805b81 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Thu, 6 Nov 2025 19:26:21 +0800 Subject: [PATCH 6/8] Update ddl_embedded_analyze.md --- ddl_embedded_analyze.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index e92eddf85edf2..0591c553eb87f 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -1,5 +1,5 @@ --- -title: `ANALYZE` Embedded in DDL +title: `ANALYZE` Embedded in DDL Statements summary: This document describes the `ANALYZE` feature embedded in DDL statements for newly created or reorganized indexes, which ensures that statistics for new indexes are updated promptly. --- From 0cbae9c7b7e50a76da212020a5c8a24b78d64e21 Mon Sep 17 00:00:00 2001 From: houfaxin Date: Thu, 6 Nov 2025 19:35:16 +0800 Subject: [PATCH 7/8] Update TOC-tidb-cloud.md --- TOC-tidb-cloud.md | 1 + 1 file changed, 1 insertion(+) diff --git a/TOC-tidb-cloud.md b/TOC-tidb-cloud.md index 5733a733d5795..edca58a37e1f0 100644 --- a/TOC-tidb-cloud.md +++ b/TOC-tidb-cloud.md @@ -735,6 +735,7 @@ - [Table Filter](/table-filter.md) - [URI Formats of External Storage Services](/external-storage-uri.md) - [DDL Execution Principles and Best Practices](/ddl-introduction.md) + - [`ANALYZE` Embedded in DDL Statements](/ddl_embedded_analyze.md) - [Batch Processing](/batch-processing.md) - [Troubleshoot Inconsistency Between Data and Indexes](/troubleshoot-data-inconsistency-errors.md) - [Notifications](/tidb-cloud/notifications.md) From 2aefccdf5d07424b5cbb115e1051d17f11098492 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 10 Nov 2025 10:47:08 +0800 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Grace Cai --- ddl_embedded_analyze.md | 14 ++++++++------ system-variables.md | 2 +- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/ddl_embedded_analyze.md b/ddl_embedded_analyze.md index 0591c553eb87f..9de981b1c2467 100644 --- a/ddl_embedded_analyze.md +++ b/ddl_embedded_analyze.md @@ -12,7 +12,7 @@ This document describes the `ANALYZE` feature embedded in the following two type When this feature is enabled, TiDB automatically runs an `ANALYZE` (statistics collection) operation before the new or reorganized index becomes visible to users. This prevents inaccurate optimizer estimates and potential plan changes caused by temporarily unavailable statistics after index creation or reorganization. -## Use scenarios +## Usage scenarios In scenarios where DDL operations alternately add or modify indexes, existing stable queries might suffer from estimation bias because the new index lacks statistics, causing the optimizer to choose suboptimal plans. For more information, see [Issue #57948](https://github.com/pingcap/tidb/issues/57948). @@ -47,9 +47,9 @@ Starting from v8.5.0, TiDB has improved heuristic comparisons between indexes an When `tidb_stats_update_during_ddl` is `ON`, executing [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) automatically runs an embedded `ANALYZE` operation after the Reorg phase finishes. This `ANALYZE` operation collects statistics for the newly created index before the index becomes visible to users, and then `ADD INDEX` proceeds with its remaining phases. -Considering that `ANALYZE` can take time, TiDB sets a timeout threshold based on the execution time of the first Reorg. If `ANALYZE` times out, `ADD INDEX` will stop waiting synchronously for `ANALYZE` to finish and will continue the subsequent process so that the index becomes visible earlier to users. This means the index statistics will be updated after `ANALYZE` completes asynchronously. +Considering that `ANALYZE` can take time, TiDB sets a timeout threshold based on the execution time of the first Reorg. If `ANALYZE` times out, `ADD INDEX` stops waiting synchronously for `ANALYZE` to finish and continues the subsequent process, making the index visible earlier to users. This means the index statistics will be updated after `ANALYZE` completes asynchronously. -Example: +For example: ```sql CREATE TABLE t (a INT, b INT, c INT); @@ -107,14 +107,16 @@ ADMIN SHOW DDL JOBS 1; 1 rows in set (0.001 sec) ``` -In the `ADD INDEX` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the subsequent `EXPLAIN`, the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by running `SHOW STATS_HISTOGRAMS`). Therefore, the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and `ANALYZE` take a long time, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it means that the DDL job is collecting statistics. +From the `ADD INDEX` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that after the execution of the `ADD INDEX` DDL statement, the subsequent `EXPLAIN` output shows that statistics for the index `idx` have been automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`). As a result, the optimizer can immediately use these statistics for range scans. If index creation or reorganization and `ANALYZE` take a long time, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. When the `COMMENTS` column in the output contains `analyzing`, it means that the DDL job is collecting statistics. ## DDL for reorganizing existing indexes When `tidb_stats_update_during_ddl` is `ON`, executing [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) or [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) that reorganizes an index will also run an embedded `ANALYZE` operation after the Reorg phase completes. The mechanism is the same as for `ADD INDEX`: - Start collecting statistics before the index becomes visible. -- If `ANALYZE` times out, [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) will not synchronously wait for `ANALYZE` to finish and will continue so the index becomes visible earlier to users. This means that the index statistics will be updated when `ANALYZE` finishes asynchronously. +- If `ANALYZE` times out, [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) stops waiting synchronously for `ANALYZE` to finish and continues the subsequent process, making the index visible earlier to users. This means that the index statistics will be updated when `ANALYZE` finishes asynchronously. + +For example: ```sql CREATE TABLE s (a VARCHAR(10), INDEX idx (a)); @@ -172,4 +174,4 @@ ADMIN SHOW DDL JOBS 1; 1 rows in set (0.001 sec) ``` -From the `MODIFY COLUMN` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that in the following `EXPLAIN` the index `idx` has its statistics automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`), so the optimizer can immediately use those statistics for a range scan. If index creation or reorganization and `ANALYZE` take a long time, check the DDL job status by executing `ADMIN SHOW DDL JOBS`. If the `COMMENTS` column contains `analyzing`, it indicates that the DDL job is collecting statistics. +From the `MODIFY COLUMN` example, when `tidb_stats_update_during_ddl` is `ON`, you can see that after the execution of the `MODIFY COLUMN` DDL statement, the subsequent `EXPLAIN` output shows that statistics for the index `idx` have been automatically collected and loaded into memory (you can verify it by executing `SHOW STATS_HISTOGRAMS`). As a result, the optimizer can immediately use these statistics for range scans. If index creation or reorganization and `ANALYZE` take a long time, you can check the DDL job status by executing `ADMIN SHOW DDL JOBS`. When the `COMMENTS` column in the output contains `analyzing`, it means that the DDL job is collecting statistics. diff --git a/system-variables.md b/system-variables.md index 5a6511e40e737..e8ce03fa70656 100644 --- a/system-variables.md +++ b/system-variables.md @@ -1630,7 +1630,7 @@ mysql> SELECT job_info FROM mysql.analyze_jobs ORDER BY end_time DESC LIMIT 1; - Persists to cluster: Yes - Applies to hint [SET_VAR](/optimizer-hints.md#set_varvar_namevar_value): No - Default value: `OFF` -- This variable controls whether to enable DDL-embedded `ANALYZE`. When enabled, DDL statements that create new indexes ([`ADD INDEX`](/sql-statements/sql-statement-add-index.md)) and DDL statements that reorganize existing indexes ([`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md)) automatically run statistics collection before the index becomes visible. For more information, see [`ANALYZE` Embedded in DDL Statements](/ddl_embedded_analyze.md). +- This variable controls whether to enable DDL-embedded `ANALYZE`. When enabled, DDL statements that create new indexes ([`ADD INDEX`](/sql-statements/sql-statement-add-index.md)) or reorganize existing indexes ([`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) and [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md)) automatically collect statistics before the index becomes visible. For more information, see [`ANALYZE` Embedded in DDL Statements](/ddl_embedded_analyze.md). ### tidb_enable_dist_task New in v7.1.0