Skip to content

Conversation

@iSignal
Copy link
Owner

@iSignal iSignal commented Apr 30, 2025

No description provided.

iSignal and others added 30 commits April 30, 2025 09:49
Address review suggestions.

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Summary:
### Motivation

YugabyteDB currently uses a 256KiB YSQL output buffer, compared to PostgreSQL’s default of 8KiB. A larger buffer allows YSQL to retry queries internally in the event of serialization failures. This is crucial because once YSQL sends partial results to the client, it cannot safely retry the query—doing so risks emitting duplicate results such as:

```
 id
----
(0 rows)

 id
----
  1
(1 row)
```

In YSQL, while retries are best-effort in REPEATABLE READ, they are essential in READ COMMITTED to ensure serialization errors are not thrown to the user. Also, PostgreSQL is not subject to read restart errors because it is a single node system. In contrast, YSQL relies on retries to avoid throwing read restart errors.

However, the current 256KiB buffer is often insufficient. Large SELECT queries commonly exceed this threshold. These same queries are also more likely to encounter restart errors due to read/write timestamp conflicts. As a result, increasing the output buffer size is a frequent operational change.

Raise the default buffer size to 1MiB, a common recommendation, to reduce friction and improve out-of-the-box reliability.

### Impact Analysis

**Q.** Do small queries incur increased memory usage?

No. Although each backend allocates a 1MiB buffer space, the OS does not actually reserve this memory unless a large query requires it. This behavior can be observed using the following script to track proportional set size (PSS):

```
#!/bin/bash

peak_pss=0

while true; do
  total_pss=0
  for pid in $(ps -eo pid,comm | grep '[p]ostgres' | awk '{print $1}'); do
    pss=$(awk '/Pss:/ {total += $2} END {print total}' /proc/$pid/smaps 2>/dev/null)
    total_pss=$((total_pss + pss))
  done

  if (( total_pss > peak_pss )); then
    peak_pss=$total_pss
  fi

  echo "Current PSS: ${total_pss} KB, Peak PSS: ${peak_pss} KB"
  sleep 1
done
```

Test Setup:

```
CREATE TABLE kv(k INT PRIMARY KEY, v INT);
INSERT INTO kv SELECT i, i FROM GENERATE_SERIES(1, 100000) i;
```

`SELECT * FROM kv LIMIT 1000` → ~131 MiB PSS

`SELECT * FROM kv` → ~132 MiB PSS

This provides evidence that the memory usage is incremental and the cost of 1 MiB buffer size is not payed unless there is a query with a large output.

**Q:** What about large queries?

* With 256KiB buffer: PSS increase ~3MiB
* With 1MiB buffer: PSS increase ~4MiB

The incremental cost is acceptable.

**Q:** How does this affect real-world workloads?

Ran TPC-H (analytical workload) via BenchBase against a replication factor 1 cluster:

* Idle PSS: ~120MiB
* Peak PSS (with and without buffer change): ~220MiB

The buffer size change had minimal impact on peak memory usage; other query-related allocations dominate.

### Caveats

1. Once allocated, buffer memory is not released until the connection closes.
2. First-row latency of large SELECT queries may increase due to buffering. This is an intentional tradeoff to reduce serialization failures.

Jira: DB-11163

Test Plan:
Jenkins

Close: yugabyte#22245
Backport-through: 2024.2

Reviewers: pjain, smishra, #db-approvers

Reviewed By: pjain

Subscribers: svc_phabricator, yql

Differential Revision: https://phorge.dev.yugabyte.com/D43805
Summary:
YbHnsw organizes data into blocks, which are loaded on demand. When new search operations require different blocks, previously loaded ones may need to be unloaded.

This diff implements block cache to manage:
- Dynamic loading of required blocks
- Unloading of inactive blocks when new blocks are needed

This ensures efficient memory usage while maintaining fast access to frequently used data.
Jira: DB-16564

Test Plan:
HnswTest.Cache
HnswTest.ConcurrentCache

Reviewers: arybochkin

Reviewed By: arybochkin

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D43752
…or deadlock

Summary:
The following lock order inversion could happen:
CatalogManager::AlterTable acquired catalog manager lock, then tries to replicate altered table information, which requires raft replica lock.

MasterSnapshotCoordinator::CreateReplicated is invoked when replica lock is held by apply thread. Then it tries to get tablets info to schedule operations.
But it is necessary to acquire catalog manager lock to obtain tablets info.

This deadlock is auto resolved via timeout in alter table.
But for this period of time all heartbeats and other operations that require catalog manager lock are blocked.

Fixed by using separate thread pool to schedule tablet operations.
Jira: DB-15933

Test Plan: ./yb_build.sh fastdebug --gcc11 --cxx-test yb-admin-snapshot-schedule-test --gtest_filter YbAdminSnapshotScheduleTestWithYsqlColocationRestoreParam.PgsqlSequenceVerifyPartialRestore/DBColocated_Clone -n 40 -- -p 8

Reviewers: mhaddad

Reviewed By: mhaddad

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D43681
…guring LDAP

Summary:
Earlier, we were showing the runtimeKeys to the user when running ldap configure. Since these keys are internal, we should avoid exposing internal implementation to an end user.
Adding yba ldap describe

```
./yba ldap describe -h
Describe LDAP configuration for YBA

Usage:
  yba ldap describe [flags]

Aliases:
  describe, show, get

Examples:
yba ldap describe

Flags:
      --user-set-only   [Optional] Only show the fields that were set by the user explicitly.
  -h, --help            help for describe
```

Test Plan:
Tested locally

```
/yba ldap configure --ldap-host 10.23.16.4 --ldap-port 636 --ldap-ssl-protocol ldaps --ldap-tls-version "TLSv1_2"  --base-dn '"CN=Users,CN=MRS,DC=LDAP,DC=COM"' --dn-prefix '"CN="' --service-account-dn '"CN=service_account,CN=MRS,DC=LDAP,DC=COM"' --service-account-password '"Service@123"' --group-member-attribute 'groupName'
LDAP configuration updated successfully.
LDAP configuration:
Key                        Value
base-dn                    CN=Users,CN=MRS,DC=LDAP,DC=COM
search-filter
default-role               ReadOnly
ldap-host                  10.23.16.4
group-use-role-mapping     true
group-search-base
group-use-query            false
ldap-port                  636
ldaps-enabled              true
search-and-bind-enabled    false
service-account-dn         CN=service_account,CN=MRS,DC=LDAP,DC=COM
group-attribute            groupName
start-tls-enabled          false
group-search-filter
search-attribute
tls-version                TLSv1_2
ldap-enabled               true
service-account-password   ********
dn-prefix                  CN=
group-search-scope         SUBTREE
```

```
./yba ldap describe                                   anabaria@dev-server-anabaria 05:37:59
LDAP configuration:
Key                        Value
base-dn                    CN=Users,CN=MRS,DC=LDAP,DC=COM
search-filter
default-role               ReadOnly
ldap-host                  10.23.16.4
group-use-role-mapping     true
group-search-base
group-use-query            false
ldap-port                  636
ldaps-enabled              true
search-and-bind-enabled    false
service-account-dn         CN=service_account,CN=MRS,DC=LDAP,DC=COM
group-attribute            groupName
start-tls-enabled          false
group-search-filter
search-attribute
tls-version                TLSv1_2
ldap-enabled               true
service-account-password   ********
dn-prefix                  CN=
group-search-scope         SUBTREE
```

Reviewers: dkumar

Reviewed By: dkumar

Differential Revision: https://phorge.dev.yugabyte.com/D43825
Summary:
`yb.orig.get_current_transaction_priority` has been failing on Mac for a long time, due to some small differences in output

```lang=diff
<  0.400000000 (High priority transaction)
---
>  0.400000095 (High priority transaction)
177c177
<  7 | 0.400000000 (High priority transaction)
---
>  7 | 0.400000095 (High priority transaction)
191c191
<  0.400000000 (High priority transaction)
---
>  0.400000095 (High priority transaction)
204,205c204,205
<  7 | 0.400000000 (High priority transaction)
<  8 | 0.400000000 (High priority transaction)
---
>  7 | 0.400000095 (High priority transaction)
>  8 | 0.400000095 (High priority transaction)
```

The solution is to split the output into two fields, one for the number and one for the comment, allowing the number to be presented in whatever way we choose. 2 decimal places suffices to show the meaning of the test without also testing floating point representation.

Create the test function `yb_get_current_transaction_priority_platform_independent` to do this, and then modify the test to use this function. The function is slightly complex because it needs to handle the case where `yb_get_current_transaction_priority()` returns just `(Highest Priority Transaction)`.
Jira: DB-16598

Test Plan:
Jenkins: test regex: .*TestPgRegressProc.*

```
./yb_build.sh --java-test TestPgRegressProc
```

Reviewers: kramanathan

Reviewed By: kramanathan

Subscribers: svc_phabricator, yql

Differential Revision: https://phorge.dev.yugabyte.com/D43854
…_fdw test

Summary:
The postgres_fdw extension transparently manages a YSQL connection to the foreign (remote) server. Queries to foreign tables are received by the extension over a “local” connection, translated into a remote query and sent over the “remote” connection to the foreign server. Consider the following example:

```sql
CREATE TABLE t1 (k INT, v INT);
CREATE FOREIGN TABLE ft1 (a INT OPTIONS (column_name 'k'), b INT OPTIONS (column_name 'v'));
```

A query which references columns `ft1.a, ft1.b` is translated into `t1.a, t1.b` as follows:

```sql
SELECT a, b FROM ft1; -- local connection
-- is translated into
SELECT k, v FROM t1; -- remote connection
```

The postgres_fdw regress test uses a loopback interface to test the foreign server. In other words, the test reuses the local DB node as a foreign server. Therefore, both the local connection and the remote connection point to the same physical objects, despite referencing different logical objects. This is problematic for DDLs as the catalog versions in the local and remote connection can go out of sync, causing `Catalog Version Mismatch` errors. The regress test currently works around this error by sleeping for 1 second after every batch of DDLs which allows for the new catalog version to propagate via tserver <--> master heartbeat.

The DDLs can broadly be put into 3 buckets:
 - The DDL only touches the foreign table entry (ft1 in the above example) AND the queries that follow it do not error out. The remote connection will always deal with the local table (t1) and there will be no mismatch.
    - Nothing needs to be done in such scenarios.
 - The DDL only touches the foreign table entry AND the queries that follow it produce a warning/error.
    - This can be worked around by waiting for the next tserver <--> master heartbeat.
    - This is necessary to ensure that the remote connection "sees" the new catalog version and uses it for the query.
    - In the absence of this, "Catalog Version Mismatch" errors are produced. This is not of concern in practical scenarios where postgres_fdw will be used to connect distinct clusters.
    - A necessary condition to encounter spurious "Catalog Version Mismatch" errors is that the query does not do any catalog look ups in the planning phase.
 - The local table (t1) is altered by the local connection. The remote connection needs to know about this before query execution.
    - This can be worked around by forcing a catalog refresh via a breaking change or by waiting out a tserver <--> master heartbeat interval.

With D43651 / a260932, backends now update the catalog version in shared memory after concluding a DDL.
As a result, other backends can learn about the bump in catalog version without having to obtain it from the tserver.
This makes waiting for a tserver <--> master heartbeat redundant in cases where both the local and remote connection are to the same DB node (ie. share memory).
Therefore, this revision now removes all instances of sleeps after DDLs.
The above analysis is provided as a reference for the future, when a variant of the same regress test may connect to a different node rather than use the loopback interface.

Jira: DB-1239, DB-1301

Test Plan:
Run the following test:
```
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressContribPostgresFdw#schedule'
```

Reviewers: kfranz, myang, #db-approvers, smishra

Reviewed By: myang, #db-approvers, smishra

Subscribers: svc_phabricator, jason, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D43708
… hint exists at the same level

Summary:
Pruning joins can be unsafe unless we are sure the pruned plan cannot participate
in the final plan. In this case, there is a mix of inner and outer joins,
no Leading hint is present, and a join needed for higher levels is incorrectly pruned.
If there is a Leading hint at a level then it is safe to prune non-hinted joins.
Jira: DB-16050

Test Plan:
TestPgRegressExtension (including new test)
TestPgRegressJoin
TestPgRegressPlanner
TestPgRegressThirdPartyExtensionsPgHintPlan
TestPgRegressJoin
TestPgRegressPgIndex
TestPgRegressAggregates
TestPgRegressPlanner
TestPgRegressTAQO
TestPgRegressParallel
TestPgRegressPgStatStatements
TestPgRegressPgSelect
TestPgRegressPartitions
TestPgRegressPgStat
TestPgRegressPgMatview
TestPgExplainAnalyzeJoins
TestPgRegressPgDml
TestPgCardinalityEstimation

Reviewers: mihnea, mtakahara

Reviewed By: mtakahara

Subscribers: jason, yql

Differential Revision: https://phorge.dev.yugabyte.com/D43371
Deepti-yb and others added 23 commits May 20, 2025 05:24
…tch nodes

Summary: The operations throw an error since the node no longer exists in the universe. Instead, for these operations, the list of all nodes in the universe will be printed

Test Plan: Manually test the 2 commands

Reviewers: skurapati

Reviewed By: skurapati

Differential Revision: https://phorge.dev.yugabyte.com/D44074
Summary:
The message property for failed tasks was previously formatted as:
"Failed to execute task {$task_params}: ${actual_error_message}".
While this provided context, it often made the error messages unnecessarily verbose and harder to read.

To improve clarity and user experience, we have updated the implementation to use only the actual error message. This has been achieved by replacing the message property with originMessage, which now displays only the raw error message returned by the task

Also, instead of ordering the subTasks by groupType we are now sorting them by position and group them if two consecutive tasks have same subTaskType.

Also, added a expand All button

Test Plan:
{F356442}
{F357448}
Tested manually

Reviewers: lsangappa

Reviewed By: lsangappa

Differential Revision: https://phorge.dev.yugabyte.com/D43972
Summary:
Add a test to verify that the enum oids are preserved after a YSQL major upgrade.
The test also verifies that when an enum type is used as a hash partition key,
rows remain correctly routed to the same partitions after the upgrade.
Jira: DB-16768

Test Plan: ./yb_build.sh release --cxx-test integration-tests_ysql_major_upgrade-test --gtest_filter YsqlMajorUpgradeTest.EnumTypes

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D43872
Summary:
Added runtime flag to control backups during DDL. The flag yb.backup.enable_backups_during_ddl will control whether
backup should run with ysql-dump read time, if supported by DB.

Incremented YBC client-server version to 2.2.0.2-b3.
Includes commits:
- Revert tablespaces default to false -
   https://github.com/yugabyte/ybc/commit/48b4628d0c65c86fcf1eca29f937eeede11376c3

- Make usage of ysql-dump read conditional on backup extended args params supplied by YBA -
   https://github.com/yugabyte/ybc/commit/ddfd3375af499f12ab1ebb0c48f1f91c277a463a

Also noticed a bug with revertToPreRoleBehavior where the params would not pass to the subtask BackupTableYbc, fixed.

Test Plan: dev itests, dev UTs

Reviewers: dshubin

Reviewed By: dshubin

Subscribers: mhaddad

Differential Revision: https://phorge.dev.yugabyte.com/D44018
* release notes for voyager 2025.5.2

* docs: Clarify versioning alignment with Yugabyte Voyager release cadence in release notes

* Apply suggestions from code review

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>

* docs: Update release notes for v2025.5.2 to include new features

* update release note

* docs: Add note on automatic schema assessment for PostgreSQL in migration guides

* edit and format

* fix

* format

* edit

---------

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
Summary:
Transition TSLocalLockManager lock calls to use the async functionality introduced in https://phorge.dev.yugabyte.com/D42862
Jira: DB-16723

Test Plan:
Jenkins

./yb_build.sh --cxx-test object_lock-test
./yb_build.sh --cxx-test ts_local_lock_manager-test
./yb_build.sh --cxx-test pg_object_locks-test

Reviewers: amitanand, zdrudi, rthallam, #db-approvers

Reviewed By: amitanand, #db-approvers

Subscribers: svc_phabricator, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D44043
…SnapshotSimple

Summary:
The test PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple started being flaky since
ab7b8ed. That commit turned on incremental
catalog cache refresh by default. The test relies on doing full catalog cache
refreshes in order to allocate enough memory. With incremental catalog cache
refresh the test no longer runs full catalog cache refreshes and that's why it
became flaky.

Commit 61c7270 changed the test to disable
incremental catalog cache refresh by setting
`--ysql-yb_enable_invalidation_messages=false` to get back the previous
behavior.

Although the test has been passing, after a recent commit
a260932 the test started being flaky again with
the same symptom. After debugging I found that the fix by
61c7270 did not work as expected because the
postmaster process is already started before we turn off
`--ysql-yb_enable_invalidation_messages=false`. The test has been passing by accident.

I reworked the fix to turn off `--ysql-yb_enable_invalidation_messages` by
implementing `SetUp()` which ensures that the postmaster process will have the
gflag `--ysql-yb_enable_invalidation_messages=false`.
Jira: DB-16328

Test Plan:
(1) ./yb_build.sh release --cxx-test pgwrapper_pg_heap_snapshot-test --gtest_filter PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple --clang19 -n 50

(2) ./yb_build.sh release --cxx-test pgwrapper_pg_heap_snapshot-test
Verify from test output that only PgHeapSnapshotTest.TestYsqlHeapSnapshotSimple has
`--ysql-yb_enable_invalidation_messages=false`. Other tests continue to
have the default value of `--ysql-yb_enable_invalidation_messages=true`.

```
I0519 23:22:53.945465 1289735 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 0
I0519 23:23:26.073213 1290285 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1
I0519 23:23:31.624302 1290674 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1
I0519 23:23:37.160039 1291064 pg_heap_snapshot-test.cc:39] FLAGS_ysql_yb_enable_invalidation_messages: 1
```

Reviewers: kfranz, sanketh, mihnea

Reviewed By: sanketh

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D44082
* add new and update schema workarounds for voyager

* keep similar wording

* Apply suggestions from code review

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>

---------

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
* Tutorial AI menus

* edit

* menus

* minor edit
* release notes for 2.25.2.0-b359

* edits

* date

---------

Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
Summary:
### Issue

After 9e1c574 / D43672, the following command fails an assertion check

```
CREATE DATABASE restored_db;
ALTER DATABASE restored_db OWNER TO yugabyte;
```

The assertion check is in CheckAlterDatabaseDdl

```
case T_AlterOwnerStmt:
	{
		const AlterOwnerStmt *const stmt = castNode(AlterOwnerStmt, parsetree);

		/*
		 * ALTER DATABASE OWNER needs to have global impact, however we
		 * may have a no-op ALTER DATABASE OWNER when the new owner is the
		 * same as the old owner and there is no write made to pg_database
		 * to turn on is_global_ddl. Also in global catalog version mode
		 * is_global_ddl does not apply so it is not turned on either.
		 */
		if (stmt->objectType == OBJECT_DATABASE)
			Assert(ddl_transaction_state.is_global_ddl ||
				   !YBCPgHasWriteOperationsInDdlTxnMode() ||
				   !YBIsDBCatalogVersionMode());
		break;
	}
```

The assertion failure happens because YBCPgHasWriteOperationsInDdlTxnMode() is true after the commit 9e1c574. The commit locks catalog version using SELECT FOR UPDATE. Although SELECT FOR UPDATE is not a write operation, the logic to determine whether there is a write operation also includes lock operations. See DoRunAsync in pg_session.cc

```
// We can have a DDL event trigger that writes to a user table instead of ysql
// catalog table. The DDL itself may be a no-op (e.g., GRANT a privilege to a
// user that already has that privilege). We do not want to account this case
// as writing to ysql catalog so we can avoid incrementing the catalog version.
has_catalog_write_ops_in_ddl_mode_ =
    has_catalog_write_ops_in_ddl_mode_ ||
    (is_ddl && !IsReadOnly(*op) && is_ysql_catalog_table);
```

IsReadOnly also includes lock operations

```
bool IsReadOnly(const PgsqlOp& op) {
  return op.is_read() && !IsValidRowMarkType(GetRowMarkType(op));
}
```

However, a SELECT FOR UPDATE by itself is not a write operation for the purposes of YBCPgHasWriteOperationsInDdlTxnMode.

### Fix

Replace !IsReadOnly(*op)  with op->is_write().

### Impact

YBCPgHasWriteOperationsInDdlTxnMode() is also called to early return YbTrackPgTxnInvalMessagesForAnalyze()

```
/*
 * If there is no write, then there are no inval messages so this commit is
 * equivalent to a no-op.
 */
if (!YBCPgHasWriteOperationsInDdlTxnMode())
	return false;
```

This change also allows this early return optimization in the presence of SELECT FOR UPDATE on the catalog version. SELECT FOR UPDATE by itself does not generate any invalidation messages.
Jira: DB-16767

Test Plan: Jenkins

Reviewers: pjain, myang, smishra

Reviewed By: pjain, myang

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D44093
Summary:
Commit b5b495d incorrectly translates
some logic, causing system and copartition secondary index scans to lose
the single-RPC optimization to embed a table and index scan together in
the same RPC.  Fix the logic and add tests to cover the cases.
Jira: DB-16789

Test Plan:
On Almalinux 8:

    ./yb_build.sh fastdebug --gcc11 daemons initdb \
      --cxx-test pgwrapper_pg_libpq-test \
      --gtest_filter PgLibPqTest.Embedded\*

Close: yugabyte#27294

Reviewers: sanketh

Reviewed By: sanketh

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D44085
Summary: Implement cGroup via node-agent

Test Plan: manual testing

Reviewers: nsingh

Reviewed By: nsingh

Differential Revision: https://phorge.dev.yugabyte.com/D44096
Summary: dumpRoleChecks can be null

Test Plan: itests

Reviewers: vkumar

Reviewed By: vkumar

Differential Revision: https://phorge.dev.yugabyte.com/D44107
Summary:
Skip collection of WARN logs in support bundle.
Reason: WARN logs are already present in the INFO logs and this is just a waste of space in the support bundle.

Test Plan:
Manually tested.
Run itests.
Run UTs

Reviewers: vkumar

Reviewed By: vkumar

Differential Revision: https://phorge.dev.yugabyte.com/D44075
iSignal pushed a commit that referenced this pull request Jun 6, 2025
Summary:
After commit f85bbca, vmodule flag is no longer respected by postgres process, for example:
```
ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k
zgrep pg_sample ~/logs/latest_test/1.log
```
shows no vlogs.

The reason is that `VLOG(1)` is used early by
```
#0  0x00007f7e1b48b090 in google::InitVLOG3__(google::SiteFlag*, int*, char const*, int)@plt () from /net/dev-server-timur/share/code/yugabyte-db/build/debug-clang19-dynamic-ninja/lib/libyb_util_shmem.so
#1  0x00007f7e1b47616e in yb::(anonymous namespace)::NegotiatorSharedState::WaitProposal (this=0x7f7e215e8000) at ../../src/yb/util/shmem/reserved_address_segment.cc:108
#2  0x00007f7e1b4781e0 in yb::AddressSegmentNegotiator::Impl::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:252
#3  0x00007f7e1b4737ce in yb::AddressSegmentNegotiator::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:376
#4  0x00007f7e1b742b7b in yb::tserver::SharedMemoryManager::InitializePostmaster (this=0x7f7e202e9788 <yb::pggate::PgSharedMemoryManager()::shared_mem_manager>, fd=45) at ../../src/yb/tserver/tserver_shared_mem.cc:252
#5  0x00007f7e2023588f in yb::pggate::PgSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/pg_shared_mem.cc:29
#6  0x00007f7e202788e9 in YBCSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/ybc_pg_shared_mem.cc:22
#7  0x000055636b8956f5 in PostmasterMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1083
#8  0x000055636b774bfe in PostgresServerProcessMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/main/main.c:209
#9  0x000055636b7751f2 in main ()
```
and caches `vmodule` value before `InitGFlags` sets it from environment.

The fix is to explicitly call `UpdateVmodule` from `InitGFlags` after setting `vmodule`.
Jira: DB-15888

Test Plan:
```
ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k
zgrep pg_sample ~/logs/latest_test/1.log
```

Reviewers: hsunder

Reviewed By: hsunder

Subscribers: ybase, yql

Tags: #jenkins-ready, #jenkins-trigger

Differential Revision: https://phorge.dev.yugabyte.com/D42731
iSignal pushed a commit that referenced this pull request Jun 6, 2025
…rdup for tablegroup_name

Summary:
As part of D36859 / 0dbe7d6, backup and restore support for colocated tables when multiple tablespaces exist was introduced. Upon
fetching the tablegroup_name from `pg_yb_tablegroup`, the value was read and assigned via `PQgetvalue` without copying. This led to a use-after-free bug when the
tablegroup_name was later read in dumpTableSchema since the result from the SQL query is immediately cleared in the next line (`PQclear`).

```
[P-yb-controller-1] ==3037==ERROR: AddressSanitizer: heap-use-after-free on address 0x51d0002013e6 at pc 0x55615b0a1f92 bp 0x7fff92475970 sp 0x7fff92475118
[P-yb-controller-1] READ of size 8 at 0x51d0002013e6 thread T0
[P-yb-controller-1]     #0 0x55615b0a1f91 in strcmp ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:470:5
[P-yb-controller-1]     #1 0x55615b1b90ba in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15789:8
[P-yb-controller-1]     #2 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     #3 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     #4 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     #5 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
[P-yb-controller-1]     #6 0x55615b0894bd in _start (${BUILD_ROOT}/postgres/bin/ysql_dump+0x10d4bd)
[P-yb-controller-1]
[P-yb-controller-1] 0x51d0002013e6 is located 358 bytes inside of 2048-byte region [0x51d000201280,0x51d000201a80)
[P-yb-controller-1] freed by thread T0 here:
[P-yb-controller-1]     #0 0x55615b127196 in free ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
[P-yb-controller-1]     #1 0x7f3c02d65e85 in PQclear ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:755:3
[P-yb-controller-1]     #2 0x55615b1c0103 in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19108:4
[P-yb-controller-1]     #3 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3
[P-yb-controller-1]     #4 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     #5 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     #6 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     #7 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
[P-yb-controller-1]
[P-yb-controller-1] previously allocated by thread T0 here:
[P-yb-controller-1]     #0 0x55615b12742f in malloc ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
[P-yb-controller-1]     #1 0x7f3c02d680a7 in pqResultAlloc ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:633:28
[P-yb-controller-1]     #2 0x7f3c02d81294 in getRowDescriptions ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:544:4
[P-yb-controller-1]     #3 0x7f3c02d7f793 in pqParseInput3 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:324:11
[P-yb-controller-1]     #4 0x7f3c02d6bcc8 in parseInput ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2014:2
[P-yb-controller-1]     #5 0x7f3c02d6bcc8 in PQgetResult ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2100:3
[P-yb-controller-1]     #6 0x7f3c02d6cd87 in PQexecFinish ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2417:19
[P-yb-controller-1]     #7 0x7f3c02d6cd87 in PQexec ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2256:9
[P-yb-controller-1]     #8 0x55615b1f45df in ExecuteSqlQuery ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:296:8
[P-yb-controller-1]     #9 0x55615b1f4213 in ExecuteSqlQueryForSingleRow ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:311:8
[P-yb-controller-1]     #10 0x55615b1c008d in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19102:10
[P-yb-controller-1]     #11 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3
[P-yb-controller-1]     yugabyte#12 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     yugabyte#13 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     yugabyte#14 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     yugabyte#15 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
```

This revision fixes the issue by using pg_strdup to make a copy of the string.
Jira: DB-15915

Test Plan: ./yb_build.sh asan --cxx-test integration-tests_xcluster_ddl_replication-test --gtest_filter XClusterDDLReplicationTest.DDLReplicationTablesNotColocated

Reviewers: aagrawal, skumar, mlillibridge, sergei

Reviewed By: aagrawal, sergei

Subscribers: sergei, yql

Differential Revision: https://phorge.dev.yugabyte.com/D43386
ddhodge pushed a commit that referenced this pull request Jun 14, 2025
…ck/release functions at TabletService

Summary:
In functions `TabletServiceImpl::AcquireObjectLocks` and `TabletServiceImpl::ReleaseObjectLocks`, we weren't returning after executing the rpc callback with initial validation steps fail. This led to segv issues like below
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined]
std::__1::unique_ptr<yb::tserver::TSLocalLockManager::Impl, std::__1::default_delete<yb::tserver::TSLocalLockManager::Impl>>::operator->[abi:ne190100](this=0x0000000000000000) const at unique_ptr.h:272:108
    frame #1: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined]
yb::tserver::TSLocalLockManager::AcquireObjectLocksAsync(this=0x0000000000000000, req=0x00005001bfffa290, deadline=yb::CoarseTimePoint @ x23, callback=0x0000ffefb6066560, wait=(value_ = true)) at ts_local_lock_manager.cc:541:3
    frame #2: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(this=0x00005001bdaf6020, req=0x00005001bfffa290, resp=0x00005001bfffa300, context=<unavailable>) at tablet_service.cc:3673:26
    frame #3: 0x0000aaaac36bd9a0 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
yb::tserver::TabletServerServiceIf::InitMethods(this=<unavailable>, req=0x00005001bfffa290, resp=0x00005001bfffa300, rpc_context=RpcContext @ 0x0000ffefb6066600)::$_36::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>)
const::'lambda'(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext)::operator()(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*,
yb::rpc::RpcContext) const at tserver_service.service.cc:1470:9
    frame #4: 0x0000aaaac36bd978 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) at
local_call.h:126:7
    frame #5: 0x0000aaaac36bd680 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36::operator()(this=<unavailable>, call=<unavailable>) const at tserver_service.service.cc:1468:7
    frame #6: 0x0000aaaac36bd5c8 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
decltype(std::declval<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&>()(std::declval<std::__1::shared_ptr<yb::rpc::InboundCall>>()))
std::__1::__invoke[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__f=<unavailable>, __args=<unavailable>) at invoke.h:149:25
    frame #7: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] void
std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__args=<unavailable>,
__args=<unavailable>) at invoke.h:224:5
    frame #8: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
std::__1::__function::__alloc_func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity>
const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __arg=<unavailable>) at function.h:171:12
    frame #9: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=<unavailable>) at
function.h:313:10
    frame #10: 0x0000aaaac36d1384 yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::__function::__value_func<void
(std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __args=nullptr) const at function.h:430:12
    frame #11: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::function<void
(std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=nullptr) const at function.h:989:10
    frame yugabyte#12: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(this=<unavailable>, call=<unavailable>) at tserver_service.service.cc:913:3
    frame yugabyte#13: 0x0000aaaac30e05b4 yb-tserver`yb::rpc::ServicePoolImpl::Handle(this=0x00005001bff9b8c0, incoming=nullptr) at service_pool.cc:275:19
    frame yugabyte#14: 0x0000aaaac3006ed0 yb-tserver`yb::rpc::InboundCall::InboundCallTask::Run(this=<unavailable>) at inbound_call.cc:309:13
    frame yugabyte#15: 0x0000aaaac30ec868 yb-tserver`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x00005001bff5c640, task=0x00005001bfdf1958) at thread_pool.cc:138:13
    frame yugabyte#16: 0x0000aaaac39afd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x00005001bfe1e750) const at function.h:430:12
    frame yugabyte#17: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x00005001bfe1e750) const at function.h:989:10
    frame yugabyte#18: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x00005001bfe1e6e0) at thread.cc:937:3
```

This revision addresses the issue by returning after executing the rpc callback with validation failure status.
Jira: DB-17124

Test Plan: Jenkins

Reviewers: rthallam, amitanand

Reviewed By: amitanand

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D44663
ddhodge pushed a commit that referenced this pull request Jun 14, 2025
…own flags are set at ObjectLockManager

Summary:
In context of object locking, commit 6e80c56 / D44228 got rid of logic that signaled obsolete waiters corresponding to transactions that issued a release all locks request (could have been terminated to failures like timeout, deadlock etc) in order to early terminate failed waiting requests. Hence, now we let the obsolete requests terminate organically from the OLM resumed by the poller thread that runs at an interval of `olm_poll_interval_ms` (defaults to 100ms).

This led to one of the itests failing with the below stack
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(std::__1::function<void ()>) [inlined] yb::ThreadPoolToken::Submit(this=<unavailable>, r=<unavailable>) at threadpool.cc:146:10
    frame #1: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(this=0x0000000000000000, f=<unavailable>) at threadpool.cc:142:10
    frame #2: 0x0000aaaac73cdfe8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoSignal(this=0x00003342bfa0d400, entry=<unavailable>) at object_lock_manager.cc:767:3
    frame #3: 0x0000aaaac73cc7c0 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>, yb::docdb::LockData&&, yb::StronglyTypedBool<yb::docdb::(anonymous
namespace)::IsLockRetry_Tag>, unsigned long, yb::Status) [inlined] yb::docdb::ObjectLockManagerImpl::PrepareAcquire(this=0x00003342bfa0d400, txn_lock=<unavailable>, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous
namespace)::TrackedTransactionLockEntry>::element_type @ 0x00003342bfa94a38, data=0x00003342b9a6a830, resume_it_offset=<unavailable>, resume_with_status=<unavailable>) at object_lock_manager.cc:523:5
    frame #4: 0x0000aaaac73cc6a8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(this=0x00003342bfa0d400, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>::element_type @
0x00003342bfa94a38, data=0x00003342b9a6a830, is_retry=(value_ = true), resume_it_offset=<unavailable>, resume_with_status=Status @ 0x0000ffefaa036658) at object_lock_manager.cc:552:27
    frame #5: 0x0000aaaac73cbcb4 yb-tserver`yb::docdb::WaiterEntry::Resume(this=0x00003342b9a6a820, lock_manager=0x00003342bfa0d400, resume_with_status=<unavailable>) at object_lock_manager.cc:381:17
    frame #6: 0x0000aaaac85bdd4c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() at object_lock_manager.cc:752:13
    frame #7: 0x0000aaaac85bda74 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::docdb::ObjectLockManager::Shutdown(this=<unavailable>) at object_lock_manager.cc:1092:10
    frame #8: 0x0000aaaac85bda6c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::tserver::TSLocalLockManager::Impl::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:411:26
    frame #9: 0x0000aaaac85bd7e8 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:566:10
    frame #10: 0x0000aaaac8665a34 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ResetAndGetTSLocalLockManager(this=0x000033423fc1ad80) at tablet_server.cc:797:28
    frame #11: 0x0000aaaac8665a18 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ProcessLeaseUpdate(this=0x000033423fc1ad80, lease_refresh_info=0x000033423a476b80) at tablet_server.cc:828:22
    frame yugabyte#12: 0x0000aaaac8665950 yb-tserver`yb::tserver::YsqlLeasePoller::Poll(this=<unavailable>) at ysql_lease_poller.cc:143:18
    frame yugabyte#13: 0x0000aaaac8438d58 yb-tserver`yb::tserver::MasterLeaderPollScheduler::Impl::Run(this=0x000033423ff5cc80) at master_leader_poller.cc:125:25
    frame yugabyte#14: 0x0000aaaac89ffd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x000033423ffc7930) const at function.h:430:12
    frame yugabyte#15: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000033423ffc7930) const at function.h:989:10
    frame yugabyte#16: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x000033423ffc78c0) at thread.cc:937:3
    frame yugabyte#17: 0x0000ffffac0378b8 libpthread.so.0`start_thread + 392
    frame yugabyte#18: 0x0000ffffac093afc libc.so.6`thread_start + 12
```
This is due to accessing unique_ptr `thread_pool_token_` after it has been reset.

This revision fixes the issue by not scheduling any tasks on the threadpool once the shutdown flags has been set (hence not accessing `thread_pool_token_`). Since we wait for in-progress requests at the OLM and also in-progress resume tasks scheduled on the messenger using `waiters_amidst_resumption_on_messenger_`, it is safe to say that `thread_pool_token_` would not be accessed once it is reset.
Jira: DB-17121

Test Plan:
Jenkins

./yb_build.sh --cxx-test='TEST_F(PgObjectLocksTestRF1, TestShutdownWithWaiters) {'

Reviewers: rthallam, amitanand, sergei

Reviewed By: amitanand

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D44662
iSignal pushed a commit that referenced this pull request Jul 10, 2025
…ow during index backfill.

Summary:
In the last few weeks we have seen few instances of the stress test (with various nemesis)
run into a master crash caused by a stack trace that looks like:

```
 * thread #1, name = 'yb-master', stop reason = signal SIGSEGV: invalid address
   * frame #0: 0x0000aaaad52f5fc4 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::shared_ptr<yb::master::BackfillTablet>::shared_ptr[abi:ue170006]<yb::master::BackfillTablet, void>(this=<unavailable>, __r=std::__1:: weak_ptr<yb::master::BackfillTablet>::element_type @ 0x000013e4bf787778) at shared_ptr.h:701:20
     frame #1: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::enable_shared_from_this<yb::master::BackfillTablet>::shared_from_this[abi:ue170006](this=0x000013e4bf787778) at shared_ptr.h:1954:17
     frame #2: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=0x000013e4bf787778) at backfill_index.cc:1300:50
     frame #3: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
     frame #4: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4d458) at backfill_index.cc:1620:5
     frame #5: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:470:3
     frame #6: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:273:5
     frame #7: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4d458) at backfill_index.cc:1463:19
     frame #8: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame #9: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
     frame #10: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cd98) at backfill_index.cc:1620:5
     frame #11: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:470:3
     frame yugabyte#12: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:273:5
     frame yugabyte#13: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cd98) at backfill_index.cc:1463:19
     frame yugabyte#14: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#15: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:     1323:10
     frame yugabyte#16: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cfd8) at backfill_index.cc:1620:5
     frame yugabyte#17: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:470:3
     frame yugabyte#18: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:273:5
     frame yugabyte#19: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cfd8) at backfill_index.cc:1463:19
     frame yugabyte#20: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#21: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:     1323:10

...

   frame yugabyte#2452: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bdc7ed98) at backfill_index.cc:1620:5
     frame yugabyte#2453: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:470:3
     frame yugabyte#2454: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:273:5
     frame yugabyte#2455: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bdc7ed98) at backfill_index.cc:1463:19
     frame yugabyte#2456: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#2457: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:   1323:10
     frame yugabyte#2458: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4ba1ff458) at backfill_index.cc:1620:5
     frame yugabyte#2459: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:470:3
     frame yugabyte#2460: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:273:5
     frame yugabyte#2461: 0x0000aaaad52c0260 yb-master`yb::master::RetryingRpcTask::RunDelayedTask(this=0x000013e4ba1ff458, status=0x0000ffffab2668c0) at async_rpc_tasks.cc:432:14
     frame yugabyte#2462: 0x0000aaaad5c3f838 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] boost::function1<void, yb::Status         const&>::operator()(this=0x000013e4bff63b18, a0=0x0000ffffab2668c0) const at function_template.hpp:763:14
     frame yugabyte#2463: 0x0000aaaad5c3f81c yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] yb::rpc::DelayedTask::                    TimerHandler(this=0x000013e4bff63ae8, watcher=<unavailable>, revents=<unavailable>) at delayed_task.cc:155:5
     frame yugabyte#2464: 0x0000aaaad5c3f284 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(loop=<unavailable>, w=<unavailable>, revents=<unavailable>) at ev++.h:479:7
     frame yugabyte#2465: 0x0000aaaad4cdf170 yb-master`ev_invoke_pending + 112
     frame yugabyte#2466: 0x0000aaaad4ce21fc yb-master`ev_run + 2940
     frame yugabyte#2467: 0x0000aaaad5c725fc yb-master`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000013e4bfcfadf8, flags=0) at ev++.h:211:7
     frame yugabyte#2468: 0x0000aaaad5c725f4 yb-master`yb::rpc::Reactor::RunThread(this=0x000013e4bfcfadc0) at reactor.cc:735:9
     frame yugabyte#2469: 0x0000aaaad65c61d8 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000013e4bfeffa80) const at function.h:517:16
     frame yugabyte#2470: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000013e4bfeffa80) const at function.h:1168:12
     frame yugabyte#2471: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(arg=0x000013e4bfeffa20) at thread.cc:895:3
```

Essentially, a BackfillChunk is considered done (without sending out an RPC) and launches the next BackfillChunk; which does the same.

This may happen if `BackfillTable::indexes_to_build()` is empty, or if the `backfill_jobs()` is empty. However, based on the code reading
we should only get there, ** after ** marking `BackfillTable::done_` as `true`.

If for some reason, we have `indexes_to_build()` as `empty` and `BackfillTable::done_ == false`, we could get into this infinite recursion.

Since I am unable to explain and recreate how this happens, I'm adding a test flag `TEST_simulate_empty_indexes` to repro this.

Fix: We update `BackfillChunk::SendRequest` to handle the empty `indexes_to_build()` as a failure rather than treating this as a success.
This prevents the infinite recursion.

Also, adding a few log lines that may help better understand the scenario if we run into this again.
Jira: DB-17296

Test Plan: yb_build.sh fastdebug  --cxx-test pg_index_backfill-test --gtest_filter *.SimulateEmptyIndexesForStackOverflow*

Reviewers: zdrudi, rthallam, jason

Reviewed By: zdrudi

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D45031
ddhodge pushed a commit that referenced this pull request Aug 20, 2025
…s closed in multi route pooling

Summary:
**Issue Summary**

A core dump was triggered during a ConnectionBurst stress test, with the crash occurring in the od_backend_close_connection function with multi route pooling. The stack trace is as follows:

frame #0: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] mm_tls_free(io=0x0000000000000000) at tls.c:91:10
    frame #1: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] machine_io_free(obj=0x0000000000000000) at io.c:201:2
    frame #2: 0x00005601a627129e odyssey`od_backend_close_connection [inlined] od_io_close(io=0x000031f53e72b8b8) at io.h:77:2
    frame #3: 0x00005601a627128c odyssey`od_backend_close_connection(server=0x000031f53e72b880) at backend.c:56:2
    frame #4: 0x00005601a6250de5 odyssey`od_router_attach(router=0x00007fff00dbeb30, client_for_router=0x000031f53e5df180, wait_for_idle=<unavailable>, external_client=0x000031f53ee30680) at router.c:1010:6
    frame #5: 0x00005601a6258b1b odyssey`od_auth_frontend [inlined] yb_execute_on_control_connection(client=0x000031f53ee30680, function=<unavailable>) at frontend.c:2842:11
    frame #6: 0x00005601a6258b0b odyssey`od_auth_frontend(client=0x000031f53ee30680) at auth.c:677:8
    frame #7: 0x00005601a626782e odyssey`od_frontend(arg=0x000031f53ee30680) at frontend.c:2539:8
    frame #8: 0x00005601a6290912 odyssey`mm_scheduler_main(arg=0x000031f53e390000) at scheduler.c:17:2
    frame #9: 0x00005601a6290b77 odyssey`mm_context_runner at context.c:28:2

**Root Cause**

The crash originated from an improper lock release in the yb_get_idle_server_to_close function, introduced in commit 55beeb0 during multi-route pooling implementation. The function released the lock on the route object, despite a comment explicitly warning against it. After returning to its caller, no lock was held on the route or idle_route. This allowed other coroutines to access and use the same route and its idle server, which the original coroutine intended to close. This race condition led to a crash due to an assertion failure during connection closure.

**Note**
If the order of acquiring locks is the same across all threads or processes differences in the release order alone cannot cause a deadlock. Deadlocks arise from circular dependencies during acquisition, not release.

In the connection manager code base:

Locks are acquired in the order: router → route. This order must be strictly enforced everywhere to prevent deadlocks.
Lock release order varies (e.g., router then route in od_router_route and yb_get_idle_server_to_close, versus the reverse elsewhere). This variation does not cause deadlocks, as release order is irrelevant to deadlock prevention.
Jira: DB-17501

Test Plan: Jenkins: all tests

Reviewers: skumar, vikram.damle, asrinivasan, arpit.saxena

Reviewed By: skumar

Subscribers: svc_phabricator, yql

Differential Revision: https://phorge.dev.yugabyte.com/D45641
iSignal pushed a commit that referenced this pull request Nov 26, 2025
Summary:
The stacktrace of the core dump:
```
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000aaaac59fb720 postgres`FreeTupleDesc [inlined] GetMemoryChunkContext(pointer=0x0000000000000000) at memutils.h:141:12
    frame #1: 0x0000aaaac59fb710 postgres`FreeTupleDesc [inlined] pfree(pointer=0x0000000000000000) at mcxt.c:1500:26
    frame #2: 0x0000aaaac59fb710 postgres`FreeTupleDesc(tupdesc=0x000013d7fd8dccc8) at tupdesc.c:326:5
    frame #3: 0x0000aaaac61c7204 postgres`RelationDestroyRelation(relation=0x000013d7fd8dc9a8, remember_tupdesc=false) at relcache.c:4577:4
    frame #4: 0x0000aaaac5febab8 postgres`YBRefreshCache at relcache.c:5216:3
    frame #5: 0x0000aaaac5feba94 postgres`YBRefreshCache at postgres.c:4442:2
    frame #6: 0x0000aaaac5feb50c postgres`YBRefreshCacheWrapperImpl(catalog_master_version=0, is_retry=false, full_refresh_allowed=true) at postgres.c:4570:3
    frame #7: 0x0000aaaac5feea34 postgres`PostgresMain [inlined] YBRefreshCacheWrapper(catalog_master_version=0, is_retry=false) at postgres.c:4586:9
    frame #8: 0x0000aaaac5feea2c postgres`PostgresMain [inlined] YBCheckSharedCatalogCacheVersion at postgres.c:4951:3
    frame #9: 0x0000aaaac5fee984 postgres`PostgresMain(dbname=<unavailable>, username=<unavailable>) at postgres.c:6574:4
    frame #10: 0x0000aaaac5efe5b4 postgres`BackendRun(port=0x000013d7ffc06400) at postmaster.c:4995:2
    frame #11: 0x0000aaaac5efdd08 postgres`ServerLoop [inlined] BackendStartup(port=0x000013d7ffc06400) at postmaster.c:4701:3
    frame yugabyte#12: 0x0000aaaac5efdc70 postgres`ServerLoop at postmaster.c:1908:7
    frame yugabyte#13: 0x0000aaaac5ef8ef8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1562:11
    frame yugabyte#14: 0x0000aaaac5ddae1c postgres`PostgresServerProcessMain(argc=25, argv=0x000013d7ffe068f0) at main.c:213:3
    frame yugabyte#15: 0x0000aaaac59dee38 postgres`main + 36
    frame yugabyte#16: 0x0000ffff9f606340 libc.so.6`__libc_start_call_main + 112
    frame yugabyte#17: 0x0000ffff9f606418 libc.so.6`__libc_start_main@@GLIBC_2.34 + 152
    frame yugabyte#18: 0x0000aaaac59ded34 postgres`_start + 52
```
It is related to invalidation message. The test involves concurrent DDL execution without object
locking.

I added a few logs to help to debug this issue.

Test Plan:
(1)
Append to the end of file ./build/latest/postgres/share/postgresql.conf.sample:

```
yb_debug_log_catcache_events=1
log_min_messages=DEBUG1
```

(2) Create a RF-1 cluster
```
./bin/yb-ctl create --rf 1
```

(3) Run the following example via ysqlsh:
```
-- === 1. SETUP ===
DROP TABLE IF EXISTS accounts_timetravel;
CREATE TABLE accounts_timetravel (
  id INT PRIMARY KEY,
  balance INT,
  last_updated TIMESTAMPTZ
);

INSERT INTO accounts_timetravel VALUES (1, 1000, now());

\echo '--- 1. Initial Data (The Past) ---'
SELECT * FROM accounts_timetravel;

-- Wait 2 seconds
SELECT pg_sleep(2);

-- === 2. CAPTURE THE "PAST" HLC TIMESTAMP ===
--
--    *** THIS IS THE FIX ***
--    Get the current time as seconds from the Unix epoch,
--    multiply by 1,000,000 to get microseconds,
--    and cast to a big integer.
--
SELECT (EXTRACT(EPOCH FROM now())*1000000)::bigint AS snapshot_hlc \gset
SELECT :snapshot_hlc;
\echo '--- (Snapshot HLC captured) ---'

SELECT * FROM pg_yb_catalog_version;

-- Wait 2 more seconds
SELECT pg_sleep(2);

-- === 3. UPDATE THE DATA ===
UPDATE accounts_timetravel SET balance = 500, last_updated = now() WHERE id = 1;

\echo '--- 2. New Data (The Present) ---'
SELECT * FROM accounts_timetravel;

CREATE TABLE foo(id int);
-- increment the catalog version
ALTER TABLE foo ADD COLUMN val TEXT;

SELECT * FROM pg_yb_catalog_version;
-- === 4. PERFORM THE TIME-TRAVEL QUERY ===
--
-- Set our 'read_time_guc' variable to the HLC value
--
\set read_time_guc :snapshot_hlc

\echo '--- 3. Time-Travel Read (Querying the Past) ---'
\echo 'Setting yb_read_time to HLC (microseconds):' :read_time_guc

-- This will now be interpolated correctly and will succeed.
SET yb_read_time = :read_time_guc;

-- This query will now correctly read the historical data
SELECT * FROM accounts_timetravel;
SELECT * FROM pg_yb_catalog_version;

-- === 5. CLEANUP ===
RESET yb_read_time;
\echo '--- 4. Back to the Present ---'
SELECT * FROM accounts_timetravel;

DROP TABLE accounts_timetravel;
```

(4) Look at the postgres log for the following samples:

```
2025-11-07 18:31:06.223 UTC [3321231] LOG:  Preloading relcache for database 13524, session user id: 10, yb_read_time: 0
```

```
2025-11-07 18:31:06.303 UTC [3321231] LOG:  Building relcache entry for pg_index (oid 2610) took 785 us
```

```
2025-11-07 18:31:09.265 UTC [3321221] LOG:  Rebuild relcache entry for accounts_timetravel (oid 16384)
```

```
2025-11-07 18:31:09.525 UTC [3321221] LOG:  Delete relcache entry for accounts_timetravel (oid 16384)
```

```
2025-11-07 18:31:14.035 UTC [3321221] DEBUG:  Setting yb_read_time to 1762540271568993
```

```
2025-11-07 18:31:14.037 UTC [3321221] LOG:  Preloading relcache for database 13524, session user id: 13523, yb_read_time: 1762540271568993
```

```
2025-11-07 18:31:14.183 UTC [3321221] DEBUG:  Setting yb_read_time to 0
```

Reviewers: kfranz, #db-approvers

Reviewed By: kfranz, #db-approvers

Subscribers: jason, yql

Differential Revision: https://phorge.dev.yugabyte.com/D48114
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.