forked from yugabyte/yugabyte-db
-
Notifications
You must be signed in to change notification settings - Fork 0
Alter table rewrite docs #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
iSignal
wants to merge
147
commits into
master
Choose a base branch
from
alter-table-rewrite-docs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Address review suggestions. Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Summary:
### Motivation
YugabyteDB currently uses a 256KiB YSQL output buffer, compared to PostgreSQL’s default of 8KiB. A larger buffer allows YSQL to retry queries internally in the event of serialization failures. This is crucial because once YSQL sends partial results to the client, it cannot safely retry the query—doing so risks emitting duplicate results such as:
```
id
----
(0 rows)
id
----
1
(1 row)
```
In YSQL, while retries are best-effort in REPEATABLE READ, they are essential in READ COMMITTED to ensure serialization errors are not thrown to the user. Also, PostgreSQL is not subject to read restart errors because it is a single node system. In contrast, YSQL relies on retries to avoid throwing read restart errors.
However, the current 256KiB buffer is often insufficient. Large SELECT queries commonly exceed this threshold. These same queries are also more likely to encounter restart errors due to read/write timestamp conflicts. As a result, increasing the output buffer size is a frequent operational change.
Raise the default buffer size to 1MiB, a common recommendation, to reduce friction and improve out-of-the-box reliability.
### Impact Analysis
**Q.** Do small queries incur increased memory usage?
No. Although each backend allocates a 1MiB buffer space, the OS does not actually reserve this memory unless a large query requires it. This behavior can be observed using the following script to track proportional set size (PSS):
```
#!/bin/bash
peak_pss=0
while true; do
total_pss=0
for pid in $(ps -eo pid,comm | grep '[p]ostgres' | awk '{print $1}'); do
pss=$(awk '/Pss:/ {total += $2} END {print total}' /proc/$pid/smaps 2>/dev/null)
total_pss=$((total_pss + pss))
done
if (( total_pss > peak_pss )); then
peak_pss=$total_pss
fi
echo "Current PSS: ${total_pss} KB, Peak PSS: ${peak_pss} KB"
sleep 1
done
```
Test Setup:
```
CREATE TABLE kv(k INT PRIMARY KEY, v INT);
INSERT INTO kv SELECT i, i FROM GENERATE_SERIES(1, 100000) i;
```
`SELECT * FROM kv LIMIT 1000` → ~131 MiB PSS
`SELECT * FROM kv` → ~132 MiB PSS
This provides evidence that the memory usage is incremental and the cost of 1 MiB buffer size is not payed unless there is a query with a large output.
**Q:** What about large queries?
* With 256KiB buffer: PSS increase ~3MiB
* With 1MiB buffer: PSS increase ~4MiB
The incremental cost is acceptable.
**Q:** How does this affect real-world workloads?
Ran TPC-H (analytical workload) via BenchBase against a replication factor 1 cluster:
* Idle PSS: ~120MiB
* Peak PSS (with and without buffer change): ~220MiB
The buffer size change had minimal impact on peak memory usage; other query-related allocations dominate.
### Caveats
1. Once allocated, buffer memory is not released until the connection closes.
2. First-row latency of large SELECT queries may increase due to buffering. This is an intentional tradeoff to reduce serialization failures.
Jira: DB-11163
Test Plan:
Jenkins
Close: yugabyte#22245
Backport-through: 2024.2
Reviewers: pjain, smishra, #db-approvers
Reviewed By: pjain
Subscribers: svc_phabricator, yql
Differential Revision: https://phorge.dev.yugabyte.com/D43805
Summary: YbHnsw organizes data into blocks, which are loaded on demand. When new search operations require different blocks, previously loaded ones may need to be unloaded. This diff implements block cache to manage: - Dynamic loading of required blocks - Unloading of inactive blocks when new blocks are needed This ensures efficient memory usage while maintaining fast access to frequently used data. Jira: DB-16564 Test Plan: HnswTest.Cache HnswTest.ConcurrentCache Reviewers: arybochkin Reviewed By: arybochkin Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43752
…or deadlock Summary: The following lock order inversion could happen: CatalogManager::AlterTable acquired catalog manager lock, then tries to replicate altered table information, which requires raft replica lock. MasterSnapshotCoordinator::CreateReplicated is invoked when replica lock is held by apply thread. Then it tries to get tablets info to schedule operations. But it is necessary to acquire catalog manager lock to obtain tablets info. This deadlock is auto resolved via timeout in alter table. But for this period of time all heartbeats and other operations that require catalog manager lock are blocked. Fixed by using separate thread pool to schedule tablet operations. Jira: DB-15933 Test Plan: ./yb_build.sh fastdebug --gcc11 --cxx-test yb-admin-snapshot-schedule-test --gtest_filter YbAdminSnapshotScheduleTestWithYsqlColocationRestoreParam.PgsqlSequenceVerifyPartialRestore/DBColocated_Clone -n 40 -- -p 8 Reviewers: mhaddad Reviewed By: mhaddad Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43681
…guring LDAP
Summary:
Earlier, we were showing the runtimeKeys to the user when running ldap configure. Since these keys are internal, we should avoid exposing internal implementation to an end user.
Adding yba ldap describe
```
./yba ldap describe -h
Describe LDAP configuration for YBA
Usage:
yba ldap describe [flags]
Aliases:
describe, show, get
Examples:
yba ldap describe
Flags:
--user-set-only [Optional] Only show the fields that were set by the user explicitly.
-h, --help help for describe
```
Test Plan:
Tested locally
```
/yba ldap configure --ldap-host 10.23.16.4 --ldap-port 636 --ldap-ssl-protocol ldaps --ldap-tls-version "TLSv1_2" --base-dn '"CN=Users,CN=MRS,DC=LDAP,DC=COM"' --dn-prefix '"CN="' --service-account-dn '"CN=service_account,CN=MRS,DC=LDAP,DC=COM"' --service-account-password '"Service@123"' --group-member-attribute 'groupName'
LDAP configuration updated successfully.
LDAP configuration:
Key Value
base-dn CN=Users,CN=MRS,DC=LDAP,DC=COM
search-filter
default-role ReadOnly
ldap-host 10.23.16.4
group-use-role-mapping true
group-search-base
group-use-query false
ldap-port 636
ldaps-enabled true
search-and-bind-enabled false
service-account-dn CN=service_account,CN=MRS,DC=LDAP,DC=COM
group-attribute groupName
start-tls-enabled false
group-search-filter
search-attribute
tls-version TLSv1_2
ldap-enabled true
service-account-password ********
dn-prefix CN=
group-search-scope SUBTREE
```
```
./yba ldap describe anabaria@dev-server-anabaria 05:37:59
LDAP configuration:
Key Value
base-dn CN=Users,CN=MRS,DC=LDAP,DC=COM
search-filter
default-role ReadOnly
ldap-host 10.23.16.4
group-use-role-mapping true
group-search-base
group-use-query false
ldap-port 636
ldaps-enabled true
search-and-bind-enabled false
service-account-dn CN=service_account,CN=MRS,DC=LDAP,DC=COM
group-attribute groupName
start-tls-enabled false
group-search-filter
search-attribute
tls-version TLSv1_2
ldap-enabled true
service-account-password ********
dn-prefix CN=
group-search-scope SUBTREE
```
Reviewers: dkumar
Reviewed By: dkumar
Differential Revision: https://phorge.dev.yugabyte.com/D43825
Summary: `yb.orig.get_current_transaction_priority` has been failing on Mac for a long time, due to some small differences in output ```lang=diff < 0.400000000 (High priority transaction) --- > 0.400000095 (High priority transaction) 177c177 < 7 | 0.400000000 (High priority transaction) --- > 7 | 0.400000095 (High priority transaction) 191c191 < 0.400000000 (High priority transaction) --- > 0.400000095 (High priority transaction) 204,205c204,205 < 7 | 0.400000000 (High priority transaction) < 8 | 0.400000000 (High priority transaction) --- > 7 | 0.400000095 (High priority transaction) > 8 | 0.400000095 (High priority transaction) ``` The solution is to split the output into two fields, one for the number and one for the comment, allowing the number to be presented in whatever way we choose. 2 decimal places suffices to show the meaning of the test without also testing floating point representation. Create the test function `yb_get_current_transaction_priority_platform_independent` to do this, and then modify the test to use this function. The function is slightly complex because it needs to handle the case where `yb_get_current_transaction_priority()` returns just `(Highest Priority Transaction)`. Jira: DB-16598 Test Plan: Jenkins: test regex: .*TestPgRegressProc.* ``` ./yb_build.sh --java-test TestPgRegressProc ``` Reviewers: kramanathan Reviewed By: kramanathan Subscribers: svc_phabricator, yql Differential Revision: https://phorge.dev.yugabyte.com/D43854
…_fdw test
Summary:
The postgres_fdw extension transparently manages a YSQL connection to the foreign (remote) server. Queries to foreign tables are received by the extension over a “local” connection, translated into a remote query and sent over the “remote” connection to the foreign server. Consider the following example:
```sql
CREATE TABLE t1 (k INT, v INT);
CREATE FOREIGN TABLE ft1 (a INT OPTIONS (column_name 'k'), b INT OPTIONS (column_name 'v'));
```
A query which references columns `ft1.a, ft1.b` is translated into `t1.a, t1.b` as follows:
```sql
SELECT a, b FROM ft1; -- local connection
-- is translated into
SELECT k, v FROM t1; -- remote connection
```
The postgres_fdw regress test uses a loopback interface to test the foreign server. In other words, the test reuses the local DB node as a foreign server. Therefore, both the local connection and the remote connection point to the same physical objects, despite referencing different logical objects. This is problematic for DDLs as the catalog versions in the local and remote connection can go out of sync, causing `Catalog Version Mismatch` errors. The regress test currently works around this error by sleeping for 1 second after every batch of DDLs which allows for the new catalog version to propagate via tserver <--> master heartbeat.
The DDLs can broadly be put into 3 buckets:
- The DDL only touches the foreign table entry (ft1 in the above example) AND the queries that follow it do not error out. The remote connection will always deal with the local table (t1) and there will be no mismatch.
- Nothing needs to be done in such scenarios.
- The DDL only touches the foreign table entry AND the queries that follow it produce a warning/error.
- This can be worked around by waiting for the next tserver <--> master heartbeat.
- This is necessary to ensure that the remote connection "sees" the new catalog version and uses it for the query.
- In the absence of this, "Catalog Version Mismatch" errors are produced. This is not of concern in practical scenarios where postgres_fdw will be used to connect distinct clusters.
- A necessary condition to encounter spurious "Catalog Version Mismatch" errors is that the query does not do any catalog look ups in the planning phase.
- The local table (t1) is altered by the local connection. The remote connection needs to know about this before query execution.
- This can be worked around by forcing a catalog refresh via a breaking change or by waiting out a tserver <--> master heartbeat interval.
With D43651 / a260932, backends now update the catalog version in shared memory after concluding a DDL.
As a result, other backends can learn about the bump in catalog version without having to obtain it from the tserver.
This makes waiting for a tserver <--> master heartbeat redundant in cases where both the local and remote connection are to the same DB node (ie. share memory).
Therefore, this revision now removes all instances of sleeps after DDLs.
The above analysis is provided as a reference for the future, when a variant of the same regress test may connect to a different node rather than use the loopback interface.
Jira: DB-1239, DB-1301
Test Plan:
Run the following test:
```
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressContribPostgresFdw#schedule'
```
Reviewers: kfranz, myang, #db-approvers, smishra
Reviewed By: myang, #db-approvers, smishra
Subscribers: svc_phabricator, jason, smishra, yql
Tags: #jenkins-ready
Differential Revision: https://phorge.dev.yugabyte.com/D43708
… hint exists at the same level Summary: Pruning joins can be unsafe unless we are sure the pruned plan cannot participate in the final plan. In this case, there is a mix of inner and outer joins, no Leading hint is present, and a join needed for higher levels is incorrectly pruned. If there is a Leading hint at a level then it is safe to prune non-hinted joins. Jira: DB-16050 Test Plan: TestPgRegressExtension (including new test) TestPgRegressJoin TestPgRegressPlanner TestPgRegressThirdPartyExtensionsPgHintPlan TestPgRegressJoin TestPgRegressPgIndex TestPgRegressAggregates TestPgRegressPlanner TestPgRegressTAQO TestPgRegressParallel TestPgRegressPgStatStatements TestPgRegressPgSelect TestPgRegressPartitions TestPgRegressPgStat TestPgRegressPgMatview TestPgExplainAnalyzeJoins TestPgRegressPgDml TestPgCardinalityEstimation Reviewers: mihnea, mtakahara Reviewed By: mtakahara Subscribers: jason, yql Differential Revision: https://phorge.dev.yugabyte.com/D43371
…tch nodes Summary: The operations throw an error since the node no longer exists in the universe. Instead, for these operations, the list of all nodes in the universe will be printed Test Plan: Manually test the 2 commands Reviewers: skurapati Reviewed By: skurapati Differential Revision: https://phorge.dev.yugabyte.com/D44074
Summary:
The message property for failed tasks was previously formatted as:
"Failed to execute task {$task_params}: ${actual_error_message}".
While this provided context, it often made the error messages unnecessarily verbose and harder to read.
To improve clarity and user experience, we have updated the implementation to use only the actual error message. This has been achieved by replacing the message property with originMessage, which now displays only the raw error message returned by the task
Also, instead of ordering the subTasks by groupType we are now sorting them by position and group them if two consecutive tasks have same subTaskType.
Also, added a expand All button
Test Plan:
{F356442}
{F357448}
Tested manually
Reviewers: lsangappa
Reviewed By: lsangappa
Differential Revision: https://phorge.dev.yugabyte.com/D43972
Summary: Add a test to verify that the enum oids are preserved after a YSQL major upgrade. The test also verifies that when an enum type is used as a hash partition key, rows remain correctly routed to the same partitions after the upgrade. Jira: DB-16768 Test Plan: ./yb_build.sh release --cxx-test integration-tests_ysql_major_upgrade-test --gtest_filter YsqlMajorUpgradeTest.EnumTypes Reviewers: telgersma Reviewed By: telgersma Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43872
Summary: Added runtime flag to control backups during DDL. The flag yb.backup.enable_backups_during_ddl will control whether backup should run with ysql-dump read time, if supported by DB. Incremented YBC client-server version to 2.2.0.2-b3. Includes commits: - Revert tablespaces default to false - https://github.com/yugabyte/ybc/commit/48b4628d0c65c86fcf1eca29f937eeede11376c3 - Make usage of ysql-dump read conditional on backup extended args params supplied by YBA - https://github.com/yugabyte/ybc/commit/ddfd3375af499f12ab1ebb0c48f1f91c277a463a Also noticed a bug with revertToPreRoleBehavior where the params would not pass to the subtask BackupTableYbc, fixed. Test Plan: dev itests, dev UTs Reviewers: dshubin Reviewed By: dshubin Subscribers: mhaddad Differential Revision: https://phorge.dev.yugabyte.com/D44018
* release notes for voyager 2025.5.2 * docs: Clarify versioning alignment with Yugabyte Voyager release cadence in release notes * Apply suggestions from code review Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> * docs: Update release notes for v2025.5.2 to include new features * update release note * docs: Add note on automatic schema assessment for PostgreSQL in migration guides * edit and format * fix * format * edit --------- Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
Summary: Transition TSLocalLockManager lock calls to use the async functionality introduced in https://phorge.dev.yugabyte.com/D42862 Jira: DB-16723 Test Plan: Jenkins ./yb_build.sh --cxx-test object_lock-test ./yb_build.sh --cxx-test ts_local_lock_manager-test ./yb_build.sh --cxx-test pg_object_locks-test Reviewers: amitanand, zdrudi, rthallam, #db-approvers Reviewed By: amitanand, #db-approvers Subscribers: svc_phabricator, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D44043
…SnapshotSimple Summary: The test PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple started being flaky since ab7b8ed. That commit turned on incremental catalog cache refresh by default. The test relies on doing full catalog cache refreshes in order to allocate enough memory. With incremental catalog cache refresh the test no longer runs full catalog cache refreshes and that's why it became flaky. Commit 61c7270 changed the test to disable incremental catalog cache refresh by setting `--ysql-yb_enable_invalidation_messages=false` to get back the previous behavior. Although the test has been passing, after a recent commit a260932 the test started being flaky again with the same symptom. After debugging I found that the fix by 61c7270 did not work as expected because the postmaster process is already started before we turn off `--ysql-yb_enable_invalidation_messages=false`. The test has been passing by accident. I reworked the fix to turn off `--ysql-yb_enable_invalidation_messages` by implementing `SetUp()` which ensures that the postmaster process will have the gflag `--ysql-yb_enable_invalidation_messages=false`. Jira: DB-16328 Test Plan: (1) ./yb_build.sh release --cxx-test pgwrapper_pg_heap_snapshot-test --gtest_filter PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple --clang19 -n 50 (2) ./yb_build.sh release --cxx-test pgwrapper_pg_heap_snapshot-test Verify from test output that only PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple has `--ysql-yb_enable_invalidation_messages=false`. Other tests continue to have the default value of `--ysql-yb_enable_invalidation_messages=true`. ``` I0519 23:22:53.945465 1289735 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 0 I0519 23:23:26.073213 1290285 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1 I0519 23:23:31.624302 1290674 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1 I0519 23:23:37.160039 1291064 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1 ``` Reviewers: kfranz, sanketh, mihnea Reviewed By: sanketh Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D44082
* add new and update schema workarounds for voyager * keep similar wording * Apply suggestions from code review Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> --------- Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
* Tutorial AI menus * edit * menus * minor edit
* release notes for 2.25.2.0-b359 * edits * date --------- Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
Summary: ### Issue After 9e1c574 / D43672, the following command fails an assertion check ``` CREATE DATABASE restored_db; ALTER DATABASE restored_db OWNER TO yugabyte; ``` The assertion check is in CheckAlterDatabaseDdl ``` case T_AlterOwnerStmt: { const AlterOwnerStmt *const stmt = castNode(AlterOwnerStmt, parsetree); /* * ALTER DATABASE OWNER needs to have global impact, however we * may have a no-op ALTER DATABASE OWNER when the new owner is the * same as the old owner and there is no write made to pg_database * to turn on is_global_ddl. Also in global catalog version mode * is_global_ddl does not apply so it is not turned on either. */ if (stmt->objectType == OBJECT_DATABASE) Assert(ddl_transaction_state.is_global_ddl || !YBCPgHasWriteOperationsInDdlTxnMode() || !YBIsDBCatalogVersionMode()); break; } ``` The assertion failure happens because YBCPgHasWriteOperationsInDdlTxnMode() is true after the commit 9e1c574. The commit locks catalog version using SELECT FOR UPDATE. Although SELECT FOR UPDATE is not a write operation, the logic to determine whether there is a write operation also includes lock operations. See DoRunAsync in pg_session.cc ``` // We can have a DDL event trigger that writes to a user table instead of ysql // catalog table. The DDL itself may be a no-op (e.g., GRANT a privilege to a // user that already has that privilege). We do not want to account this case // as writing to ysql catalog so we can avoid incrementing the catalog version. has_catalog_write_ops_in_ddl_mode_ = has_catalog_write_ops_in_ddl_mode_ || (is_ddl && !IsReadOnly(*op) && is_ysql_catalog_table); ``` IsReadOnly also includes lock operations ``` bool IsReadOnly(const PgsqlOp& op) { return op.is_read() && !IsValidRowMarkType(GetRowMarkType(op)); } ``` However, a SELECT FOR UPDATE by itself is not a write operation for the purposes of YBCPgHasWriteOperationsInDdlTxnMode. ### Fix Replace !IsReadOnly(*op) with op->is_write(). ### Impact YBCPgHasWriteOperationsInDdlTxnMode() is also called to early return YbTrackPgTxnInvalMessagesForAnalyze() ``` /* * If there is no write, then there are no inval messages so this commit is * equivalent to a no-op. */ if (!YBCPgHasWriteOperationsInDdlTxnMode()) return false; ``` This change also allows this early return optimization in the presence of SELECT FOR UPDATE on the catalog version. SELECT FOR UPDATE by itself does not generate any invalidation messages. Jira: DB-16767 Test Plan: Jenkins Reviewers: pjain, myang, smishra Reviewed By: pjain, myang Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D44093
Summary: Commit b5b495d incorrectly translates some logic, causing system and copartition secondary index scans to lose the single-RPC optimization to embed a table and index scan together in the same RPC. Fix the logic and add tests to cover the cases. Jira: DB-16789 Test Plan: On Almalinux 8: ./yb_build.sh fastdebug --gcc11 daemons initdb \ --cxx-test pgwrapper_pg_libpq-test \ --gtest_filter PgLibPqTest.Embedded\* Close: yugabyte#27294 Reviewers: sanketh Reviewed By: sanketh Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D44085
Summary: Implement cGroup via node-agent Test Plan: manual testing Reviewers: nsingh Reviewed By: nsingh Differential Revision: https://phorge.dev.yugabyte.com/D44096
Summary: dumpRoleChecks can be null Test Plan: itests Reviewers: vkumar Reviewed By: vkumar Differential Revision: https://phorge.dev.yugabyte.com/D44107
Summary: Skip collection of WARN logs in support bundle. Reason: WARN logs are already present in the INFO logs and this is just a waste of space in the support bundle. Test Plan: Manually tested. Run itests. Run UTs Reviewers: vkumar Reviewed By: vkumar Differential Revision: https://phorge.dev.yugabyte.com/D44075
iSignal
pushed a commit
that referenced
this pull request
Jun 6, 2025
Summary: After commit f85bbca, vmodule flag is no longer respected by postgres process, for example: ``` ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k zgrep pg_sample ~/logs/latest_test/1.log ``` shows no vlogs. The reason is that `VLOG(1)` is used early by ``` #0 0x00007f7e1b48b090 in google::InitVLOG3__(google::SiteFlag*, int*, char const*, int)@plt () from /net/dev-server-timur/share/code/yugabyte-db/build/debug-clang19-dynamic-ninja/lib/libyb_util_shmem.so #1 0x00007f7e1b47616e in yb::(anonymous namespace)::NegotiatorSharedState::WaitProposal (this=0x7f7e215e8000) at ../../src/yb/util/shmem/reserved_address_segment.cc:108 #2 0x00007f7e1b4781e0 in yb::AddressSegmentNegotiator::Impl::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:252 #3 0x00007f7e1b4737ce in yb::AddressSegmentNegotiator::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:376 #4 0x00007f7e1b742b7b in yb::tserver::SharedMemoryManager::InitializePostmaster (this=0x7f7e202e9788 <yb::pggate::PgSharedMemoryManager()::shared_mem_manager>, fd=45) at ../../src/yb/tserver/tserver_shared_mem.cc:252 #5 0x00007f7e2023588f in yb::pggate::PgSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/pg_shared_mem.cc:29 #6 0x00007f7e202788e9 in YBCSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/ybc_pg_shared_mem.cc:22 #7 0x000055636b8956f5 in PostmasterMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1083 #8 0x000055636b774bfe in PostgresServerProcessMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/main/main.c:209 #9 0x000055636b7751f2 in main () ``` and caches `vmodule` value before `InitGFlags` sets it from environment. The fix is to explicitly call `UpdateVmodule` from `InitGFlags` after setting `vmodule`. Jira: DB-15888 Test Plan: ``` ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k zgrep pg_sample ~/logs/latest_test/1.log ``` Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, yql Tags: #jenkins-ready, #jenkins-trigger Differential Revision: https://phorge.dev.yugabyte.com/D42731
iSignal
pushed a commit
that referenced
this pull request
Jun 6, 2025
…rdup for tablegroup_name Summary: As part of D36859 / 0dbe7d6, backup and restore support for colocated tables when multiple tablespaces exist was introduced. Upon fetching the tablegroup_name from `pg_yb_tablegroup`, the value was read and assigned via `PQgetvalue` without copying. This led to a use-after-free bug when the tablegroup_name was later read in dumpTableSchema since the result from the SQL query is immediately cleared in the next line (`PQclear`). ``` [P-yb-controller-1] ==3037==ERROR: AddressSanitizer: heap-use-after-free on address 0x51d0002013e6 at pc 0x55615b0a1f92 bp 0x7fff92475970 sp 0x7fff92475118 [P-yb-controller-1] READ of size 8 at 0x51d0002013e6 thread T0 [P-yb-controller-1] #0 0x55615b0a1f91 in strcmp ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:470:5 [P-yb-controller-1] #1 0x55615b1b90ba in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15789:8 [P-yb-controller-1] #2 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4 [P-yb-controller-1] #3 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4 [P-yb-controller-1] #4 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3 [P-yb-controller-1] #5 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802) [P-yb-controller-1] #6 0x55615b0894bd in _start (${BUILD_ROOT}/postgres/bin/ysql_dump+0x10d4bd) [P-yb-controller-1] [P-yb-controller-1] 0x51d0002013e6 is located 358 bytes inside of 2048-byte region [0x51d000201280,0x51d000201a80) [P-yb-controller-1] freed by thread T0 here: [P-yb-controller-1] #0 0x55615b127196 in free ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3 [P-yb-controller-1] #1 0x7f3c02d65e85 in PQclear ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:755:3 [P-yb-controller-1] #2 0x55615b1c0103 in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19108:4 [P-yb-controller-1] #3 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3 [P-yb-controller-1] #4 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4 [P-yb-controller-1] #5 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4 [P-yb-controller-1] #6 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3 [P-yb-controller-1] #7 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802) [P-yb-controller-1] [P-yb-controller-1] previously allocated by thread T0 here: [P-yb-controller-1] #0 0x55615b12742f in malloc ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3 [P-yb-controller-1] #1 0x7f3c02d680a7 in pqResultAlloc ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:633:28 [P-yb-controller-1] #2 0x7f3c02d81294 in getRowDescriptions ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:544:4 [P-yb-controller-1] #3 0x7f3c02d7f793 in pqParseInput3 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:324:11 [P-yb-controller-1] #4 0x7f3c02d6bcc8 in parseInput ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2014:2 [P-yb-controller-1] #5 0x7f3c02d6bcc8 in PQgetResult ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2100:3 [P-yb-controller-1] #6 0x7f3c02d6cd87 in PQexecFinish ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2417:19 [P-yb-controller-1] #7 0x7f3c02d6cd87 in PQexec ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2256:9 [P-yb-controller-1] #8 0x55615b1f45df in ExecuteSqlQuery ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:296:8 [P-yb-controller-1] #9 0x55615b1f4213 in ExecuteSqlQueryForSingleRow ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:311:8 [P-yb-controller-1] #10 0x55615b1c008d in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19102:10 [P-yb-controller-1] #11 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3 [P-yb-controller-1] yugabyte#12 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4 [P-yb-controller-1] yugabyte#13 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4 [P-yb-controller-1] yugabyte#14 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3 [P-yb-controller-1] yugabyte#15 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802) ``` This revision fixes the issue by using pg_strdup to make a copy of the string. Jira: DB-15915 Test Plan: ./yb_build.sh asan --cxx-test integration-tests_xcluster_ddl_replication-test --gtest_filter XClusterDDLReplicationTest.DDLReplicationTablesNotColocated Reviewers: aagrawal, skumar, mlillibridge, sergei Reviewed By: aagrawal, sergei Subscribers: sergei, yql Differential Revision: https://phorge.dev.yugabyte.com/D43386
ddhodge
pushed a commit
that referenced
this pull request
Jun 14, 2025
…ck/release functions at TabletService Summary: In functions `TabletServiceImpl::AcquireObjectLocks` and `TabletServiceImpl::ReleaseObjectLocks`, we weren't returning after executing the rpc callback with initial validation steps fail. This led to segv issues like below ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined] std::__1::unique_ptr<yb::tserver::TSLocalLockManager::Impl, std::__1::default_delete<yb::tserver::TSLocalLockManager::Impl>>::operator->[abi:ne190100](this=0x0000000000000000) const at unique_ptr.h:272:108 frame #1: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined] yb::tserver::TSLocalLockManager::AcquireObjectLocksAsync(this=0x0000000000000000, req=0x00005001bfffa290, deadline=yb::CoarseTimePoint @ x23, callback=0x0000ffefb6066560, wait=(value_ = true)) at ts_local_lock_manager.cc:541:3 frame #2: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(this=0x00005001bdaf6020, req=0x00005001bfffa290, resp=0x00005001bfffa300, context=<unavailable>) at tablet_service.cc:3673:26 frame #3: 0x0000aaaac36bd9a0 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] yb::tserver::TabletServerServiceIf::InitMethods(this=<unavailable>, req=0x00005001bfffa290, resp=0x00005001bfffa300, rpc_context=RpcContext @ 0x0000ffefb6066600)::$_36::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext)::operator()(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) const at tserver_service.service.cc:1470:9 frame #4: 0x0000aaaac36bd978 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) at local_call.h:126:7 frame #5: 0x0000aaaac36bd680 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36::operator()(this=<unavailable>, call=<unavailable>) const at tserver_service.service.cc:1468:7 frame #6: 0x0000aaaac36bd5c8 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] decltype(std::declval<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&>()(std::declval<std::__1::shared_ptr<yb::rpc::InboundCall>>())) std::__1::__invoke[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__f=<unavailable>, __args=<unavailable>) at invoke.h:149:25 frame #7: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__args=<unavailable>, __args=<unavailable>) at invoke.h:224:5 frame #8: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] std::__1::__function::__alloc_func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __arg=<unavailable>) at function.h:171:12 frame #9: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=<unavailable>) at function.h:313:10 frame #10: 0x0000aaaac36d1384 yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::__function::__value_func<void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __args=nullptr) const at function.h:430:12 frame #11: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::function<void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=nullptr) const at function.h:989:10 frame yugabyte#12: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(this=<unavailable>, call=<unavailable>) at tserver_service.service.cc:913:3 frame yugabyte#13: 0x0000aaaac30e05b4 yb-tserver`yb::rpc::ServicePoolImpl::Handle(this=0x00005001bff9b8c0, incoming=nullptr) at service_pool.cc:275:19 frame yugabyte#14: 0x0000aaaac3006ed0 yb-tserver`yb::rpc::InboundCall::InboundCallTask::Run(this=<unavailable>) at inbound_call.cc:309:13 frame yugabyte#15: 0x0000aaaac30ec868 yb-tserver`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x00005001bff5c640, task=0x00005001bfdf1958) at thread_pool.cc:138:13 frame yugabyte#16: 0x0000aaaac39afd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x00005001bfe1e750) const at function.h:430:12 frame yugabyte#17: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x00005001bfe1e750) const at function.h:989:10 frame yugabyte#18: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x00005001bfe1e6e0) at thread.cc:937:3 ``` This revision addresses the issue by returning after executing the rpc callback with validation failure status. Jira: DB-17124 Test Plan: Jenkins Reviewers: rthallam, amitanand Reviewed By: amitanand Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D44663
ddhodge
pushed a commit
that referenced
this pull request
Jun 14, 2025
…own flags are set at ObjectLockManager Summary: In context of object locking, commit 6e80c56 / D44228 got rid of logic that signaled obsolete waiters corresponding to transactions that issued a release all locks request (could have been terminated to failures like timeout, deadlock etc) in order to early terminate failed waiting requests. Hence, now we let the obsolete requests terminate organically from the OLM resumed by the poller thread that runs at an interval of `olm_poll_interval_ms` (defaults to 100ms). This led to one of the itests failing with the below stack ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV: address not mapped to object * frame #0: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(std::__1::function<void ()>) [inlined] yb::ThreadPoolToken::Submit(this=<unavailable>, r=<unavailable>) at threadpool.cc:146:10 frame #1: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(this=0x0000000000000000, f=<unavailable>) at threadpool.cc:142:10 frame #2: 0x0000aaaac73cdfe8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoSignal(this=0x00003342bfa0d400, entry=<unavailable>) at object_lock_manager.cc:767:3 frame #3: 0x0000aaaac73cc7c0 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>, yb::docdb::LockData&&, yb::StronglyTypedBool<yb::docdb::(anonymous namespace)::IsLockRetry_Tag>, unsigned long, yb::Status) [inlined] yb::docdb::ObjectLockManagerImpl::PrepareAcquire(this=0x00003342bfa0d400, txn_lock=<unavailable>, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>::element_type @ 0x00003342bfa94a38, data=0x00003342b9a6a830, resume_it_offset=<unavailable>, resume_with_status=<unavailable>) at object_lock_manager.cc:523:5 frame #4: 0x0000aaaac73cc6a8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(this=0x00003342bfa0d400, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>::element_type @ 0x00003342bfa94a38, data=0x00003342b9a6a830, is_retry=(value_ = true), resume_it_offset=<unavailable>, resume_with_status=Status @ 0x0000ffefaa036658) at object_lock_manager.cc:552:27 frame #5: 0x0000aaaac73cbcb4 yb-tserver`yb::docdb::WaiterEntry::Resume(this=0x00003342b9a6a820, lock_manager=0x00003342bfa0d400, resume_with_status=<unavailable>) at object_lock_manager.cc:381:17 frame #6: 0x0000aaaac85bdd4c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() at object_lock_manager.cc:752:13 frame #7: 0x0000aaaac85bda74 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::docdb::ObjectLockManager::Shutdown(this=<unavailable>) at object_lock_manager.cc:1092:10 frame #8: 0x0000aaaac85bda6c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::tserver::TSLocalLockManager::Impl::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:411:26 frame #9: 0x0000aaaac85bd7e8 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:566:10 frame #10: 0x0000aaaac8665a34 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ResetAndGetTSLocalLockManager(this=0x000033423fc1ad80) at tablet_server.cc:797:28 frame #11: 0x0000aaaac8665a18 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ProcessLeaseUpdate(this=0x000033423fc1ad80, lease_refresh_info=0x000033423a476b80) at tablet_server.cc:828:22 frame yugabyte#12: 0x0000aaaac8665950 yb-tserver`yb::tserver::YsqlLeasePoller::Poll(this=<unavailable>) at ysql_lease_poller.cc:143:18 frame yugabyte#13: 0x0000aaaac8438d58 yb-tserver`yb::tserver::MasterLeaderPollScheduler::Impl::Run(this=0x000033423ff5cc80) at master_leader_poller.cc:125:25 frame yugabyte#14: 0x0000aaaac89ffd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x000033423ffc7930) const at function.h:430:12 frame yugabyte#15: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000033423ffc7930) const at function.h:989:10 frame yugabyte#16: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x000033423ffc78c0) at thread.cc:937:3 frame yugabyte#17: 0x0000ffffac0378b8 libpthread.so.0`start_thread + 392 frame yugabyte#18: 0x0000ffffac093afc libc.so.6`thread_start + 12 ``` This is due to accessing unique_ptr `thread_pool_token_` after it has been reset. This revision fixes the issue by not scheduling any tasks on the threadpool once the shutdown flags has been set (hence not accessing `thread_pool_token_`). Since we wait for in-progress requests at the OLM and also in-progress resume tasks scheduled on the messenger using `waiters_amidst_resumption_on_messenger_`, it is safe to say that `thread_pool_token_` would not be accessed once it is reset. Jira: DB-17121 Test Plan: Jenkins ./yb_build.sh --cxx-test='TEST_F(PgObjectLocksTestRF1, TestShutdownWithWaiters) {' Reviewers: rthallam, amitanand, sergei Reviewed By: amitanand Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D44662
iSignal
pushed a commit
that referenced
this pull request
Jul 10, 2025
…ow during index backfill. Summary: In the last few weeks we have seen few instances of the stress test (with various nemesis) run into a master crash caused by a stack trace that looks like: ``` * thread #1, name = 'yb-master', stop reason = signal SIGSEGV: invalid address * frame #0: 0x0000aaaad52f5fc4 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::shared_ptr<yb::master::BackfillTablet>::shared_ptr[abi:ue170006]<yb::master::BackfillTablet, void>(this=<unavailable>, __r=std::__1:: weak_ptr<yb::master::BackfillTablet>::element_type @ 0x000013e4bf787778) at shared_ptr.h:701:20 frame #1: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::enable_shared_from_this<yb::master::BackfillTablet>::shared_from_this[abi:ue170006](this=0x000013e4bf787778) at shared_ptr.h:1954:17 frame #2: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=0x000013e4bf787778) at backfill_index.cc:1300:50 frame #3: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10 frame #4: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4d458) at backfill_index.cc:1620:5 frame #5: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:470:3 frame #6: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:273:5 frame #7: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4d458) at backfill_index.cc:1463:19 frame #8: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19 frame #9: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10 frame #10: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cd98) at backfill_index.cc:1620:5 frame #11: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:470:3 frame yugabyte#12: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:273:5 frame yugabyte#13: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cd98) at backfill_index.cc:1463:19 frame yugabyte#14: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19 frame yugabyte#15: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10 frame yugabyte#16: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cfd8) at backfill_index.cc:1620:5 frame yugabyte#17: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:470:3 frame yugabyte#18: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:273:5 frame yugabyte#19: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cfd8) at backfill_index.cc:1463:19 frame yugabyte#20: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19 frame yugabyte#21: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10 ... frame yugabyte#2452: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bdc7ed98) at backfill_index.cc:1620:5 frame yugabyte#2453: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:470:3 frame yugabyte#2454: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:273:5 frame yugabyte#2455: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bdc7ed98) at backfill_index.cc:1463:19 frame yugabyte#2456: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19 frame yugabyte#2457: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10 frame yugabyte#2458: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4ba1ff458) at backfill_index.cc:1620:5 frame yugabyte#2459: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:470:3 frame yugabyte#2460: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:273:5 frame yugabyte#2461: 0x0000aaaad52c0260 yb-master`yb::master::RetryingRpcTask::RunDelayedTask(this=0x000013e4ba1ff458, status=0x0000ffffab2668c0) at async_rpc_tasks.cc:432:14 frame yugabyte#2462: 0x0000aaaad5c3f838 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] boost::function1<void, yb::Status const&>::operator()(this=0x000013e4bff63b18, a0=0x0000ffffab2668c0) const at function_template.hpp:763:14 frame yugabyte#2463: 0x0000aaaad5c3f81c yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] yb::rpc::DelayedTask:: TimerHandler(this=0x000013e4bff63ae8, watcher=<unavailable>, revents=<unavailable>) at delayed_task.cc:155:5 frame yugabyte#2464: 0x0000aaaad5c3f284 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(loop=<unavailable>, w=<unavailable>, revents=<unavailable>) at ev++.h:479:7 frame yugabyte#2465: 0x0000aaaad4cdf170 yb-master`ev_invoke_pending + 112 frame yugabyte#2466: 0x0000aaaad4ce21fc yb-master`ev_run + 2940 frame yugabyte#2467: 0x0000aaaad5c725fc yb-master`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000013e4bfcfadf8, flags=0) at ev++.h:211:7 frame yugabyte#2468: 0x0000aaaad5c725f4 yb-master`yb::rpc::Reactor::RunThread(this=0x000013e4bfcfadc0) at reactor.cc:735:9 frame yugabyte#2469: 0x0000aaaad65c61d8 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000013e4bfeffa80) const at function.h:517:16 frame yugabyte#2470: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000013e4bfeffa80) const at function.h:1168:12 frame yugabyte#2471: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(arg=0x000013e4bfeffa20) at thread.cc:895:3 ``` Essentially, a BackfillChunk is considered done (without sending out an RPC) and launches the next BackfillChunk; which does the same. This may happen if `BackfillTable::indexes_to_build()` is empty, or if the `backfill_jobs()` is empty. However, based on the code reading we should only get there, ** after ** marking `BackfillTable::done_` as `true`. If for some reason, we have `indexes_to_build()` as `empty` and `BackfillTable::done_ == false`, we could get into this infinite recursion. Since I am unable to explain and recreate how this happens, I'm adding a test flag `TEST_simulate_empty_indexes` to repro this. Fix: We update `BackfillChunk::SendRequest` to handle the empty `indexes_to_build()` as a failure rather than treating this as a success. This prevents the infinite recursion. Also, adding a few log lines that may help better understand the scenario if we run into this again. Jira: DB-17296 Test Plan: yb_build.sh fastdebug --cxx-test pg_index_backfill-test --gtest_filter *.SimulateEmptyIndexesForStackOverflow* Reviewers: zdrudi, rthallam, jason Reviewed By: zdrudi Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D45031
ddhodge
pushed a commit
that referenced
this pull request
Aug 20, 2025
…s closed in multi route pooling
Summary:
**Issue Summary**
A core dump was triggered during a ConnectionBurst stress test, with the crash occurring in the od_backend_close_connection function with multi route pooling. The stack trace is as follows:
frame #0: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] mm_tls_free(io=0x0000000000000000) at tls.c:91:10
frame #1: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] machine_io_free(obj=0x0000000000000000) at io.c:201:2
frame #2: 0x00005601a627129e odyssey`od_backend_close_connection [inlined] od_io_close(io=0x000031f53e72b8b8) at io.h:77:2
frame #3: 0x00005601a627128c odyssey`od_backend_close_connection(server=0x000031f53e72b880) at backend.c:56:2
frame #4: 0x00005601a6250de5 odyssey`od_router_attach(router=0x00007fff00dbeb30, client_for_router=0x000031f53e5df180, wait_for_idle=<unavailable>, external_client=0x000031f53ee30680) at router.c:1010:6
frame #5: 0x00005601a6258b1b odyssey`od_auth_frontend [inlined] yb_execute_on_control_connection(client=0x000031f53ee30680, function=<unavailable>) at frontend.c:2842:11
frame #6: 0x00005601a6258b0b odyssey`od_auth_frontend(client=0x000031f53ee30680) at auth.c:677:8
frame #7: 0x00005601a626782e odyssey`od_frontend(arg=0x000031f53ee30680) at frontend.c:2539:8
frame #8: 0x00005601a6290912 odyssey`mm_scheduler_main(arg=0x000031f53e390000) at scheduler.c:17:2
frame #9: 0x00005601a6290b77 odyssey`mm_context_runner at context.c:28:2
**Root Cause**
The crash originated from an improper lock release in the yb_get_idle_server_to_close function, introduced in commit 55beeb0 during multi-route pooling implementation. The function released the lock on the route object, despite a comment explicitly warning against it. After returning to its caller, no lock was held on the route or idle_route. This allowed other coroutines to access and use the same route and its idle server, which the original coroutine intended to close. This race condition led to a crash due to an assertion failure during connection closure.
**Note**
If the order of acquiring locks is the same across all threads or processes differences in the release order alone cannot cause a deadlock. Deadlocks arise from circular dependencies during acquisition, not release.
In the connection manager code base:
Locks are acquired in the order: router → route. This order must be strictly enforced everywhere to prevent deadlocks.
Lock release order varies (e.g., router then route in od_router_route and yb_get_idle_server_to_close, versus the reverse elsewhere). This variation does not cause deadlocks, as release order is irrelevant to deadlock prevention.
Jira: DB-17501
Test Plan: Jenkins: all tests
Reviewers: skumar, vikram.damle, asrinivasan, arpit.saxena
Reviewed By: skumar
Subscribers: svc_phabricator, yql
Differential Revision: https://phorge.dev.yugabyte.com/D45641
iSignal
pushed a commit
that referenced
this pull request
Nov 26, 2025
Summary: The stacktrace of the core dump: ``` (lldb) bt all * thread #1, name = 'postgres', stop reason = signal SIGSEGV: address not mapped to object * frame #0: 0x0000aaaac59fb720 postgres`FreeTupleDesc [inlined] GetMemoryChunkContext(pointer=0x0000000000000000) at memutils.h:141:12 frame #1: 0x0000aaaac59fb710 postgres`FreeTupleDesc [inlined] pfree(pointer=0x0000000000000000) at mcxt.c:1500:26 frame #2: 0x0000aaaac59fb710 postgres`FreeTupleDesc(tupdesc=0x000013d7fd8dccc8) at tupdesc.c:326:5 frame #3: 0x0000aaaac61c7204 postgres`RelationDestroyRelation(relation=0x000013d7fd8dc9a8, remember_tupdesc=false) at relcache.c:4577:4 frame #4: 0x0000aaaac5febab8 postgres`YBRefreshCache at relcache.c:5216:3 frame #5: 0x0000aaaac5feba94 postgres`YBRefreshCache at postgres.c:4442:2 frame #6: 0x0000aaaac5feb50c postgres`YBRefreshCacheWrapperImpl(catalog_master_version=0, is_retry=false, full_refresh_allowed=true) at postgres.c:4570:3 frame #7: 0x0000aaaac5feea34 postgres`PostgresMain [inlined] YBRefreshCacheWrapper(catalog_master_version=0, is_retry=false) at postgres.c:4586:9 frame #8: 0x0000aaaac5feea2c postgres`PostgresMain [inlined] YBCheckSharedCatalogCacheVersion at postgres.c:4951:3 frame #9: 0x0000aaaac5fee984 postgres`PostgresMain(dbname=<unavailable>, username=<unavailable>) at postgres.c:6574:4 frame #10: 0x0000aaaac5efe5b4 postgres`BackendRun(port=0x000013d7ffc06400) at postmaster.c:4995:2 frame #11: 0x0000aaaac5efdd08 postgres`ServerLoop [inlined] BackendStartup(port=0x000013d7ffc06400) at postmaster.c:4701:3 frame yugabyte#12: 0x0000aaaac5efdc70 postgres`ServerLoop at postmaster.c:1908:7 frame yugabyte#13: 0x0000aaaac5ef8ef8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1562:11 frame yugabyte#14: 0x0000aaaac5ddae1c postgres`PostgresServerProcessMain(argc=25, argv=0x000013d7ffe068f0) at main.c:213:3 frame yugabyte#15: 0x0000aaaac59dee38 postgres`main + 36 frame yugabyte#16: 0x0000ffff9f606340 libc.so.6`__libc_start_call_main + 112 frame yugabyte#17: 0x0000ffff9f606418 libc.so.6`__libc_start_main@@GLIBC_2.34 + 152 frame yugabyte#18: 0x0000aaaac59ded34 postgres`_start + 52 ``` It is related to invalidation message. The test involves concurrent DDL execution without object locking. I added a few logs to help to debug this issue. Test Plan: (1) Append to the end of file ./build/latest/postgres/share/postgresql.conf.sample: ``` yb_debug_log_catcache_events=1 log_min_messages=DEBUG1 ``` (2) Create a RF-1 cluster ``` ./bin/yb-ctl create --rf 1 ``` (3) Run the following example via ysqlsh: ``` -- === 1. SETUP === DROP TABLE IF EXISTS accounts_timetravel; CREATE TABLE accounts_timetravel ( id INT PRIMARY KEY, balance INT, last_updated TIMESTAMPTZ ); INSERT INTO accounts_timetravel VALUES (1, 1000, now()); \echo '--- 1. Initial Data (The Past) ---' SELECT * FROM accounts_timetravel; -- Wait 2 seconds SELECT pg_sleep(2); -- === 2. CAPTURE THE "PAST" HLC TIMESTAMP === -- -- *** THIS IS THE FIX *** -- Get the current time as seconds from the Unix epoch, -- multiply by 1,000,000 to get microseconds, -- and cast to a big integer. -- SELECT (EXTRACT(EPOCH FROM now())*1000000)::bigint AS snapshot_hlc \gset SELECT :snapshot_hlc; \echo '--- (Snapshot HLC captured) ---' SELECT * FROM pg_yb_catalog_version; -- Wait 2 more seconds SELECT pg_sleep(2); -- === 3. UPDATE THE DATA === UPDATE accounts_timetravel SET balance = 500, last_updated = now() WHERE id = 1; \echo '--- 2. New Data (The Present) ---' SELECT * FROM accounts_timetravel; CREATE TABLE foo(id int); -- increment the catalog version ALTER TABLE foo ADD COLUMN val TEXT; SELECT * FROM pg_yb_catalog_version; -- === 4. PERFORM THE TIME-TRAVEL QUERY === -- -- Set our 'read_time_guc' variable to the HLC value -- \set read_time_guc :snapshot_hlc \echo '--- 3. Time-Travel Read (Querying the Past) ---' \echo 'Setting yb_read_time to HLC (microseconds):' :read_time_guc -- This will now be interpolated correctly and will succeed. SET yb_read_time = :read_time_guc; -- This query will now correctly read the historical data SELECT * FROM accounts_timetravel; SELECT * FROM pg_yb_catalog_version; -- === 5. CLEANUP === RESET yb_read_time; \echo '--- 4. Back to the Present ---' SELECT * FROM accounts_timetravel; DROP TABLE accounts_timetravel; ``` (4) Look at the postgres log for the following samples: ``` 2025-11-07 18:31:06.223 UTC [3321231] LOG: Preloading relcache for database 13524, session user id: 10, yb_read_time: 0 ``` ``` 2025-11-07 18:31:06.303 UTC [3321231] LOG: Building relcache entry for pg_index (oid 2610) took 785 us ``` ``` 2025-11-07 18:31:09.265 UTC [3321221] LOG: Rebuild relcache entry for accounts_timetravel (oid 16384) ``` ``` 2025-11-07 18:31:09.525 UTC [3321221] LOG: Delete relcache entry for accounts_timetravel (oid 16384) ``` ``` 2025-11-07 18:31:14.035 UTC [3321221] DEBUG: Setting yb_read_time to 1762540271568993 ``` ``` 2025-11-07 18:31:14.037 UTC [3321221] LOG: Preloading relcache for database 13524, session user id: 13523, yb_read_time: 1762540271568993 ``` ``` 2025-11-07 18:31:14.183 UTC [3321221] DEBUG: Setting yb_read_time to 0 ``` Reviewers: kfranz, #db-approvers Reviewed By: kfranz, #db-approvers Subscribers: jason, yql Differential Revision: https://phorge.dev.yugabyte.com/D48114
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.