Skip to content

Commit a0007bf

Browse files
premkumrddhodge
andauthored
Adding page for CBO (yugabyte#24736)
* Adding CBO * tidyups * edit * fixes from review * Update docs/content/preview/architecture/query-layer/planner-optimizer.md * added RBO and other details * minor fixes * Adding yb_reset_analyze_statistics correctly * feedback from Mihnea * feedback from Mihnea * fix links * Apply suggestions from code review * backport * correction * minor edits --------- Co-authored-by: Dwight Hodge <ghodge@yugabyte.com> Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
1 parent 65f27b1 commit a0007bf

File tree

19 files changed

+488
-316
lines changed

19 files changed

+488
-316
lines changed

docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ type: docs
1414

1515
## Synopsis
1616

17-
ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog. These statistics help the query planner to determine the most efficient execution plans for queries.
17+
ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries.
1818

19-
The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution.
19+
The statistics are also used by the YugabyteDB [cost based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution.
2020

2121
{{< warning title="Run ANALYZE manually" >}}
2222
Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually.
@@ -51,6 +51,18 @@ Table name to be analyzed; may be schema-qualified. Optional. Omit to analyze al
5151

5252
List of columns to be analyzed. Optional. Omit to analyze all columns of the table.
5353

54+
## Reset statistics
55+
56+
Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also, when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant.
57+
58+
The `yb_reset_analyze_statistics()` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables in a database. Call this function as follows:
59+
60+
```sql
61+
SELECT yb_reset_analyze_statistics ( table_oid );
62+
```
63+
64+
If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze.
65+
5466
## Examples
5567

5668
### Analyze a single table

docs/content/preview/architecture/docdb/lsm-sst.md

Lines changed: 36 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,51 +12,77 @@ menu:
1212
type: docs
1313
---
1414

15-
A log-structured merge-tree (LSM tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads.
15+
A [log-structured merge-tree (LSM tree)](https://en.wikipedia.org/wiki/Log-structured_merge-tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads.
1616

1717
The core idea behind an LSM tree is to separate the write and read paths, allowing writes to be sequential and buffered in memory making them faster than random writes, while reads can still access data efficiently through a hierarchical structure of sorted files on disk.
1818

19-
An LSM tree has 2 primary components - Memtable and SSTs. Let's look into each of them in detail and understand how they work during writes and reads.
19+
An LSM tree has 2 primary components - [Memtable](#memtable) and [Sorted String Tables (SSTs)](#sst). Let's look into each of them in detail and understand how they work during writes and reads.
2020

2121
{{<note>}}
2222
Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses the Raft logs for this purpose. For more details, see [Raft log vs LSM WAL](../performance/#raft-vs-rocksdb-wal-logs).
2323
{{</note>}}
2424

2525
## Comparison to B-tree
2626

27-
Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree) based storage system. But YugabyteDB had to chose an LSM based storage to build a highly scalable database for of the following reasons.
27+
Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree)-based storage system. But Yugabyte chose LSM-based storage to build a highly scalable database for the following reasons:
2828

29-
- Write operations (insert, update, delete) are more expensive in a B-tree. As it involves random writes and in place node splitting and rebalancing. In an LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch.
29+
- Write operations (insert, update, delete) are more expensive in a B-tree, requiring random writes and in-place node splitting and rebalancing. In LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch.
3030
- The append-only nature of LSM makes it more efficient for concurrent write operations.
3131

3232
## Memtable
3333

34-
All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a Memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the Memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that Memtable.
34+
All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that memtable.
3535

36-
The immutable Memtable is then flushed to disk as an SST (Sorted String Table) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk.
36+
## Flush to SST
37+
38+
The immutable [memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the memtable to disk in a sorted order, creating an SST file. DocDB maintains one active memtable, and at most one immutable memtable at any point in time. This ensures that write operations can continue to be processed in the active memtable while the immutable memtable is being flushed to disk.
3739

3840
## SST
3941

40-
Each SST (Sorted String Table) file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks.
42+
Each SST file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks.
4143

4244
Each SST file contains a bloom filter, which is a space-efficient data structure that helps quickly determine whether a key might exist in that file or not, avoiding unnecessary disk reads.
4345

4446
{{<note>}}
45-
Most LSMs organize SSTS into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0).
47+
Most LSMs organize SSTs into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0).
4648
{{</note>}}
4749

50+
Three core low-level operations are used to iterate through the data in SST files.
51+
52+
### Seek
53+
54+
The _seek_ operation is used to locate a specific key or position in an SST file or memtable. When performing a seek, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek positions the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching.
55+
56+
### Next
57+
58+
The _next_ operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a seek, a sequence of next operations can scan through keys in ascending order.
59+
60+
### Previous
61+
62+
The _previous_ operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after seeking to a key near the end of a range, previous can be used to iterate through keys in descending order, often needed in order-by-descending queries.
63+
4864
## Write path
4965

50-
When new data is written to the LSM system, it is first inserted into the active Memtable. As the Memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups.
66+
When new data is written to the LSM system, it is first inserted into the active memtable. As the memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups.
5167

5268
## Read Path
5369

54-
To read a key, the LSM tree first checks the Memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs.
70+
To read a key, the LSM tree first checks the memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs.
71+
72+
## Delete path
73+
74+
Rather than immediately removing the key from SSTs, the delete operation marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction), when tombstones are removed along with the data they mark as deleted.
5575

5676
## Compaction
5777

5878
As data accumulates in SSTs, a process called compaction merges and sorts the SST files with overlapping key ranges producing a new set of SST files. The merge process during compaction helps to organize and sort the data, maintaining a consistent on-disk format and reclaiming space from obsolete data versions.
5979

80+
The [YB-TServer](../../yb-tserver/) manages multiple compaction queues and enforces throttling to avoid compaction storms. Although full compactions can be scheduled, they can also be triggered manually. Full compactions are also triggered automatically if the system detects tombstones and obsolete keys affecting read performance.
81+
82+
{{<lead link="../../yb-tserver/">}}
83+
To learn more about YB-TServer compaction operations, refer to [YB-TServer](../../yb-tserver/)
84+
{{</lead>}}
85+
6086
## Learn more
6187

6288
- [Blog: Background Compactions in YugabyteDB](https://www.yugabyte.com/blog/background-data-compaction/#what-is-a-data-compaction)

docs/content/preview/architecture/query-layer/_index.md

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -58,32 +58,25 @@ Views are realized during this phase. Whenever a query against a view (that is,
5858

5959
### Planner
6060

61-
YugabyteDB needs to determine the most efficient way to execute a query and return the results. This process is handled by the query planner/optimizer component.
61+
The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution.
6262

6363
The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data.
6464

65-
If the query involves joining multiple tables, the planner evaluates different techniques to combine the data:
65+
After determining the optimal plan, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions.
6666

67-
- Nested loop join: Scanning one table for each row in the other table. This can be efficient if one table is small or has a good index.
68-
- Merge join: Sorting both tables by the join columns and then merging them in parallel. This works well when the tables are already sorted or can be efficiently sorted.
69-
- Hash join: Building a hash table from one table and then scanning the other table to find matches in the hash table.
70-
For queries involving more than two tables, the planner considers different sequences of joining the tables to find the most efficient approach.
67+
The execution plan is then passed to the query executor component, which carries out the plan and returns the final query results.
7168

72-
The planner estimates the cost of each possible execution plan and chooses the one expected to be the fastest, taking into account factors like table sizes, indexes, sorting requirements, and so on.
73-
74-
After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results.
75-
76-
{{<note>}}
77-
The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements.
78-
{{</note>}}
69+
{{<lead link="./planner-optimizer/">}}
70+
To learn how the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/)
71+
{{</lead>}}
7972

8073
### Executor
8174

82-
After the query planner determines the optimal execution plan, the query executor component runs the plan and retrieves the required data. The executor sends appropriate requests to the other YB-TServers that hold the needed data to performs sorts, joins, aggregations, and then evaluates qualifications and finally returns the derived rows.
75+
After the query planner determines the optimal execution plan, the executor runs the plan and retrieves the required data. The executor sends requests to the other YB-TServers that hold the data needed to perform sorts, joins, and aggregations, then evaluates qualifications, and finally returns the derived rows.
8376

8477
The executor works in a step-by-step fashion, recursively processing the plan from top to bottom. Each node in the plan tree is responsible for fetching or computing rows of data as requested by its parent node.
8578

86-
For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to get rows from them.
79+
For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to retrieve rows.
8780

8881
A child node may be a "Sort" node, which requests rows from its child, sorts them, and returns the sorted rows. The bottom-most child could be a "Sequential Scan" node that reads rows directly from a table.
8982

docs/content/preview/architecture/query-layer/join-strategies.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,9 @@ aliases:
88
- /preview/explore/ysql-language-features/join-strategies/
99
menu:
1010
preview:
11-
name: Join strategies
1211
identifier: joins-strategies-ysql
1312
parent: architecture-query-layer
14-
weight: 100
13+
weight: 200
1514
type: docs
1615
---
1716

0 commit comments

Comments
 (0)