Skip to content

Conversation

@iSignal
Copy link
Owner

@iSignal iSignal commented Jun 25, 2020

test

@iSignal iSignal force-pushed the yb-ctl branch 15 times, most recently from 1b339e6 to e2764ec Compare June 26, 2020 16:14
iSignal added 2 commits June 26, 2020 11:52
Summary:
Making modifications in a secondary repo and pulling them in is tiring. It is not clear the
separate repo benefits are worth this process.

Test Plan:
Run yb-ctl in dev repo. TODO: Build a package and verify yb-ctl works properly from
within the package

Reviewers: bogdan, mikhail, jason

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8674
iSignal pushed a commit that referenced this pull request Feb 26, 2021
Summary:
This reverts commit 4c0a2fe.

There is currently an issue with the new clock introduced with this diff that causes a crash in some scenarios when the metric is queried.
```#0  Now (this=0x742f680192a46a02) at ../../src/yb/common/clock.h:27
#1  lag_ms (this=0x5dadc40) at ../../src/yb/util/metrics.h:1355
#2  yb::AtomicMillisLag::WriteForPrometheus (this=0x5dadc40, writer=0x7fbdc2f84bb0, attr=..., opts=...) at ../../src/yb/util/metrics.h:1374
#3  0x00007fbdedc2ab83 in yb::MetricEntity::WriteForPrometheus (this=<optimized out>, writer=writer@entry=0x7fbdc2f84bb0, opts=...) at ../../src/yb/util/metrics.cc:351
#4  0x00007fbdedc2cf05 in yb::MetricRegistry::WriteForPrometheus (this=this@entry=0x1a70a80, writer=writer@entry=0x7fbdc2f84bb0, opts=...) at ../../src/yb/util/metrics.cc:491
#5  0x00007fbdf2cfe6d0 in yb::(anonymous namespace)::WriteForPrometheus (metrics=0x1a70a80, req=..., resp=0x7fbdc2f84de0) at ../../src/yb/server/default-path-handlers.cc:278
#6  0x00007fbdf2d2d95c in operator() (__args#1=0x7fbdc2f84de0, __args#0=..., this=<optimized out>)
    at /home/yugabyte/yb-software/yugabyte-2.3.0.0-b88-centos-x86_64/linuxbrew-xxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#7  yb::Webserver::RunPathHandler (this=this@entry=0x1d64000, handler=..., connection=connection@entry=0x7d88000, request_info=request_info@entry=0x7d88000) at ../../src/yb/server/webserver.cc:423
#8  0x00007fbdf2d2e5ea in yb::Webserver::BeginRequestCallback (this=0x1d64000, connection=0x7d88000, request_info=0x7d88000) at ../../src/yb/server/webserver.cc:360
#9  0x00007fbdf2d438f6 in handle_request () from /home/yugabyte/yb-software/yugabyte-2.3.0.0-b88-centos-x86_64/lib/yb/libserver_process.so
#10 0x00007fbdf2d464de in worker_thread () from /home/yugabyte/yb-software/yugabyte-2.3.0.0-b88-centos-x86_64/lib/yb/libserver_process.so
#11 0x00007fbde84c5694 in start_thread (arg=0x7fbdc2f8f700) at pthread_create.c:333
yugabyte#12 0x00007fbde7c0241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109```

Test Plan: Build and unit tests

Reviewers: bogdan, amitanand, kannan

Reviewed By: kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9322
iSignal pushed a commit that referenced this pull request Feb 26, 2021
Summary:
```
WARNING: ThreadSanitizer: data race (pid=11311)
1762	   Read of size 8 at 0x7b74000cfb58 by thread T155:
1763	     #0 std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >::operator bool() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/tsan/libcxx/include/c++/v1/memory:2619:19 (libtablet.so+0x211067)
1764	     #1 yb::tablet::Tablet::Flush(yb::tablet::FlushMode, yb::tablet::FlushFlags, long) src/yb/tablet/tablet.cc:1847 (libtablet.so+0x211067)
1765	     #2 yb::tserver::MiniTabletServer::FlushTablets(yb::tablet::FlushMode, yb::tablet::FlushFlags)::$_2::operator()(yb::tablet::TabletPeer*) const src/yb/tserver/mini_tablet_server.cc:200:35 (libtserver.so+0x1a0997)
...
1771	     #8 yb::tserver::MiniTabletServer::FlushTablets(yb::tablet::FlushMode, yb::tablet::FlushFlags) src/yb/tserver/mini_tablet_server.cc:196:10 (libtserver.so+0x19f4e8)
1772	     #9 yb::MiniCluster::FlushTablets(yb::tablet::FlushMode, yb::tablet::FlushFlags) src/yb/integration-tests/mini_cluster.cc:369:5 (libintegration-tests.so+0x10cf88)
1773	     #10 yb::client::QLStressTest_LongRemoteBootstrap_Test::TestBody()::$_8::operator()() const src/yb/client/ql-stress-test.cc:972:7 (ql-stress-test+0x4f2f1d)

Previous write of size 8 at 0x7b74000cfb58 by thread T49 (mutexes: write M263877791325157728):
1780	     #0 std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >::reset(rocksdb::DB*) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/tsan/libcxx/include/c++/v1/memory:2632:20 (libtablet.so+0x209c9c)
1781	     #1 yb::tablet::ResetRocksDB(bool, rocksdb::Options const&, std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >*) src/yb/tablet/tablet.cc:950 (libtablet.so+0x209c9c)
1782	     #2 yb::tablet::Tablet::ResetRocksDBs(yb::StronglyTypedBool<yb::tablet::Destroy_Tag>, yb::StronglyTypedBool<yb::tablet::DisableFlushOnShutdown_Tag>) src/yb/tablet/tablet.cc:967:27 (libtablet.so+0x209b3b)
1783	     #3 yb::tablet::Tablet::CompleteShutdown(yb::StronglyTypedBool<yb::tablet::IsDropTable_Tag>) src/yb/tablet/tablet.cc:931:3 (libtablet.so+0x205bac)
1784	     #4 yb::tablet::TabletPeer::CompleteShutdown(yb::StronglyTypedBool<yb::tablet::IsDropTable_Tag>) src/yb/tablet/tablet_peer.cc:475:14 (libtablet.so+0x287f71)
1785	     #5 yb::tablet::TabletPeer::Shutdown(yb::StronglyTypedBool<yb::tablet::IsDropTable_Tag>) src/yb/tablet/tablet_peer.cc:529:5 (libtablet.so+0x28896d)
1786	     #6 yb::tserver::TSTabletManager::DeleteTablet(string const&, yb::tablet::TabletDataState, boost::optional<long> const&, boost::optional<yb::tserver::TabletServerErrorPB_Code>*) src/yb/tserver/ts_tablet_manager.cc:1297:16 (libtserver.so+0x22d59e)
1787	     #7 yb::tserver::TabletServiceAdminImpl::DeleteTablet(yb::tserver::DeleteTabletRequestPB const*, yb::tserver::DeleteTabletResponsePB*, yb::rpc::RpcContext) src/yb/tserver/tablet_service.cc:1158:41 (libtserver.so+0x1e1de7)
1788	     #8 yb::tserver::TabletServerAdminServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) src/yb/tserver/tserver_admin.service.cc:130:7 (libtserver_admin_proto.so+0x73cb0)
1789	     #9 yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) src/yb/rpc/service_pool.cc:262:19 (libyrpc.so+0x20aa57)
1790	     #10 yb::rpc::InboundCall::InboundCallTask::Run() src/yb/rpc/inbound_call.cc:212:13 (libyrpc.so+0x1745ee)
```

Added missing `ScopedRWOperation` into `Tablet::Flush` to avoid destroying tablets during flush.

Test Plan: `ybd --remote --dltp tsan --cxx-test client_ql-stress-test --gtest_filter QLStressTest.LongRemoteBootstrap -n 500 -- -p 1`

Reviewers: bogdan, mikhail

Reviewed By: mikhail

Subscribers: zyu, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9541
@iSignal iSignal force-pushed the master branch 5 times, most recently from 4810899 to f571bcb Compare May 25, 2021 20:08
iSignal pushed a commit that referenced this pull request Jul 23, 2021
…Load method

Summary:
The following error in asan test shows that `ConcurrentPod::Load` may cause timestamp overflow in case of using large (i.e. MonoDelta::kMax) timeouts.

Logs:
```
/opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210611151621-82ad1fb3bb-centos7-clang7/installed/asan/libcxx/include/c++/v1/chrono:1205:35: runtime error: signed integer overflow: 2023109119671 + 9223372036854000000 cannot be represented in type 'long long'

    #0 0x7f355b328304 in std::__1::common_type<std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> >, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::type std::__1::chrono::operator+<long long, std::__1::ratio<1l, 1000000000l>, long long, std::__1::ratio<1l, 1000000000l> >(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210611151621-82ad1fb3bb-centos7-clang7/installed/asan/libcxx/include/c++/v1/chrono:1205:35
    #1 0x7f355b328304 in std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::common_type<std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> >, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::type> std::__1::chrono::operator+<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> >, long long, std::__1::ratio<1l, 1000000000l> >(std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210611151621-82ad1fb3bb-centos7-clang7/installed/asan/libcxx/include/c++/v1/chrono:1504
    #2 0x7f355b328304 in yb::ConcurrentPod<boost::asio::ip::basic_endpoint<boost::asio::ip::tcp> >::Load() const $BUILD_ROOT/../../src/yb/util/concurrent_pod.h:42
    #3 0x7f355b32270f in yb::rpc::Proxy::DoAsyncRequest(yb::rpc::RemoteMethod const*, google::protobuf::Message const&, google::protobuf::Message*, yb::rpc::RpcController*, std::__1::function<void ()>, bool) $BUILD_ROOT/../../src/yb/rpc/proxy.cc:200:28
    #4 0x7f355b321d06 in yb::rpc::Proxy::AsyncRequest(yb::rpc::RemoteMethod const*, google::protobuf::Message const&, google::protobuf::Message*, yb::rpc::RpcController*, std::__1::function<void ()>) $BUILD_ROOT/../../src/yb/rpc/proxy.cc:123:3
    #5 0x7f355d0070e8 in yb::tserver::TabletServerServiceProxy::UpdateTransactionAsync(yb::tserver::UpdateTransactionRequestPB const&, yb::tserver::UpdateTransactionResponsePB*, yb::rpc::RpcController*, std::__1::function<void ()>) $BUILD_ROOT/src/yb/tserver/tserver_service.proxy.cc:137:11
    #6 0x7f3560f6ec8e in yb::client::(anonymous namespace)::UpdateTransactionTraits::InvokeAsync(yb::tserver::TabletServerServiceProxy*, yb::tserver::UpdateTransactionRequestPB*, yb::tserver::UpdateTransactionResponsePB*, yb::rpc::RpcController*, std::__1::function<void ()>) $BUILD_ROOT/../../src/yb/client/transaction_rpc.cc:181:1
    #7 0x7f3560f6e0cf in yb::client::(anonymous namespace)::TransactionRpc<yb::client::(anonymous namespace)::UpdateTransactionTraits>::InvokeAsync(yb::tserver::TabletServerServiceProxy*, yb::rpc::RpcController*, std::__1::function<void ()>) $BUILD_ROOT/../../src/yb/client/transaction_rpc.cc:128:5
    #8 0x7f3560f6de08 in yb::client::(anonymous namespace)::TransactionRpcBase::SendRpcToTserver(int) $BUILD_ROOT/../../src/yb/client/transaction_rpc.cc:77:5
...
```

To avoid overflow even with large timeout condition

```
CoarseMonoClock::now() > time1 + timeout_
```
is substituted with
```
CoarseMonoClock::now() - time1 > timeout_
```

Test Plan:
Run existed test

```
./yb_build.sh --clang7 asan --java-test org.yb.pgsql.TestSecureClusterLocalTServerHostName
```

Reviewers: alex, mihnea, sergei

Reviewed By: sergei

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D12182
iSignal pushed a commit that referenced this pull request Jul 23, 2021
Summary:
Some time ago we had optimizations disabled for debug build type, but it was enabled during fix of yugabyte#1291: yugabyte@f710367 ( https://phabricator.dev.yugabyte.com/D6660 ). Now we no longer have `retryable_rpc_single_call_timeout_ms` flag, also optimizations in debug build make it harder to investigate issues because of optimized stack traces and variables. So, we can disable these optimizations again to make debugging easier.

Before (note <optimized out> values that are not available for debugging):
```
(gdb) bt
#0  0x00007f97990bfa6b in raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x00007f97a45268b9 in AddHash (num_probes=6, total_bits=523776, num_lines=<optimized out>, data=0x2bdc000 "", h=4266458700) at ../../src/yb/rocksdb/util/bloom.cc:66
#2  rocksdb::(anonymous namespace)::FixedSizeFilterBitsBuilder::AddKey (this=<optimized out>, key=...) at ../../src/yb/rocksdb/util/bloom.cc:463
#3  0x00007f97a44f0818 in rocksdb::FixedSizeFilterBlockBuilder::AddKey (this=this@entry=0x1c87a40, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:97
#4  0x00007f97a44f08b0 in rocksdb::FixedSizeFilterBlockBuilder::Add (this=0x1c87a40, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:91
#5  0x00007f97a44cf295 in rocksdb::BlockBasedTableBuilder::Add (this=0x1cd1c00, key=..., value=...) at ../../src/yb/rocksdb/table/block_based_table_builder.cc:468
#6  0x00007f97a439b9a4 in rocksdb::BuildTable (dbname=..., env=0x7f97a48a2c00 <rocksdb::Env::Default()::default_env>, ioptions=..., env_options=..., table_cache=0x1ccf740, iter=0x7f97969858f8, meta=0x7f97969864a0, internal_comparator=
    std::shared_ptr<const rocksdb::InternalKeyComparator> (use count 3, weak count 0) = {...}, int_tbl_prop_collector_factories=std::vector of length 1, capacity 1 = {...}, column_family_id=0,
    snapshots=std::vector of length 0, capacity 0, earliest_write_conflict_snapshot=72057594037927935, compression=rocksdb::kSnappyCompression, compression_opts=..., paranoid_file_checks=false, internal_stats=0x1d40200,
    boundary_values_extractor=0x1cd03f0, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x7f9796987000) at ../../src/yb/rocksdb/db/builder.cc:160
#7  0x00007f97a4444ced in rocksdb::FlushJob::WriteLevel0Table (this=this@entry=0x7f9796986f40, mems=..., edit=0x1d44268, meta=meta@entry=0x7f97969864a0) at ../../src/yb/rocksdb/db/flush_job.cc:290
#8  0x00007f97a444669c in rocksdb::FlushJob::Run (this=this@entry=0x7f9796986f40, file_meta=file_meta@entry=0x7f9796986d00) at ../../src/yb/rocksdb/db/flush_job.cc:191
#9  0x00007f97a43fb5ba in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x1d24000, cfd=cfd@entry=0x1a7b000, mutable_cf_options=..., made_progress=made_progress@entry=0x7f9796987f47,
    job_context=job_context@entry=0x7f9796987d70, log_buffer=0x7f9796987480) at ../../src/yb/rocksdb/db/db_impl.cc:1873
#10 0x00007f97a43fc505 in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x1d24000, made_progress=made_progress@entry=0x7f9796987f47, job_context=job_context@entry=0x7f9796987d70, log_buffer=log_buffer@entry=0x7f9796987480,
    cfd=0x1a7b000, cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3202
#11 0x00007f97a4406cb3 in rocksdb::DBImpl::BackgroundCallFlush (this=this@entry=0x1d24000, cfd=cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3276
yugabyte#12 0x00007f97a4406f6d in rocksdb::DBImpl::BGWorkFlush (db=db@entry=0x1d24000) at ../../src/yb/rocksdb/db/db_impl.cc:3132
yugabyte#13 0x00007f97a4540875 in rocksdb::ThreadPool::BGThread (this=0x1adeb60, thread_id=0) at ../../src/yb/rocksdb/util/thread_posix.cc:126
yugabyte#14 0x00007f97a4540899 in operator() (__closure=<optimized out>) at ../../src/yb/rocksdb/util/thread_posix.cc:165
yugabyte#15 std::_Function_handler<void(), rocksdb::ThreadPool::StartBGThreads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1871
yugabyte#16 0x00007f979d804626 in operator() (this=0x1cbdc78) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
yugabyte#17 yb::Thread::SuperviseThread (arg=0x1cbdc20) at ../../src/yb/util/thread.cc:771
yugabyte#18 0x00007f97990b7694 in start_thread (arg=0x7f9796990700) at pthread_create.c:333
yugabyte#19 0x00007f9798df941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```

After:
```
#0  0x00007f1a9109ba6b in raise (sig=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x00007f1a9edd7343 in rocksdb::(anonymous namespace)::AddHash (h=4266458700, data=0x2d58000 "", num_lines=1023, total_bits=523776, num_probes=6) at ../../src/yb/rocksdb/util/bloom.cc:66
#2  0x00007f1a9edd87c8 in rocksdb::(anonymous namespace)::FixedSizeFilterBitsBuilder::AddKey (this=0x2d30040, key=...) at ../../src/yb/rocksdb/util/bloom.cc:463
#3  0x00007f1a9ed8e757 in rocksdb::FixedSizeFilterBlockBuilder::AddKey (this=0x1ef22d0, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:97
#4  0x00007f1a9ed8e6eb in rocksdb::FixedSizeFilterBlockBuilder::Add (this=0x1ef22d0, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:91
#5  0x00007f1a9ed5b2ea in rocksdb::BlockBasedTableBuilder::Add (this=0x1e4dc00, key=..., value=...) at ../../src/yb/rocksdb/table/block_based_table_builder.cc:468
#6  0x00007f1a9eb80bd4 in rocksdb::BuildTable (dbname="/tmp/yb_tests__2020-12-14T18_15_29__23449.18214.17918/mytestdb-814110369", env=0x7f1a9f35fe40 <rocksdb::Env::Default()::default_env>, ioptions=..., env_options=...,
    table_cache=0x1e4b740, iter=0x7f1a8e961858, meta=0x7f1a8e962480, internal_comparator=std::shared_ptr<const rocksdb::InternalKeyComparator> (use count 3, weak count 0) = {...},
    int_tbl_prop_collector_factories=std::vector of length 1, capacity 1 = {...}, column_family_id=0, snapshots=std::vector of length 0, capacity 0, earliest_write_conflict_snapshot=72057594037927935,
    compression=rocksdb::kSnappyCompression, compression_opts=..., paranoid_file_checks=false, internal_stats=0x1ebc200, boundary_values_extractor=0x1e4c3f0, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x7f1a8e962f70)
    at ../../src/yb/rocksdb/db/builder.cc:160
#7  0x00007f1a9ec8e56b in rocksdb::FlushJob::WriteLevel0Table (this=0x7f1a8e962eb0, mems=..., edit=0x1ec0268, meta=0x7f1a8e962480) at ../../src/yb/rocksdb/db/flush_job.cc:290
#8  0x00007f1a9ec8d767 in rocksdb::FlushJob::Run (this=0x7f1a8e962eb0, file_meta=0x7f1a8e962c70) at ../../src/yb/rocksdb/db/flush_job.cc:191
#9  0x00007f1a9ec10c56 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=0x1ea0000, cfd=0x1bf7000, mutable_cf_options=..., made_progress=0x7f1a8e9640b7, job_context=0x7f1a8e963ee0, log_buffer=0x7f1a8e9635f0)
    at ../../src/yb/rocksdb/db/db_impl.cc:1873
#10 0x00007f1a9ec18bb6 in rocksdb::DBImpl::BackgroundFlush (this=0x1ea0000, made_progress=0x7f1a8e9640b7, job_context=0x7f1a8e963ee0, log_buffer=0x7f1a8e9635f0, cfd=0x1bf7000) at ../../src/yb/rocksdb/db/db_impl.cc:3202
#11 0x00007f1a9ec1914d in rocksdb::DBImpl::BackgroundCallFlush (this=0x1ea0000, cfd=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3276
yugabyte#12 0x00007f1a9ec182fa in rocksdb::DBImpl::BGWorkFlush (db=0x1ea0000) at ../../src/yb/rocksdb/db/db_impl.cc:3132
yugabyte#13 0x00007f1a9ee02747 in rocksdb::ThreadPool::BGThread (this=0x1c5ab60, thread_id=0) at ../../src/yb/rocksdb/util/thread_posix.cc:126
yugabyte#14 0x00007f1a9ee028c6 in rocksdb::ThreadPool::<lambda()>::operator()(void) const (__closure=0x1e39c78) at ../../src/yb/rocksdb/util/thread_posix.cc:165
yugabyte#15 0x00007f1a9ee03140 in std::_Function_handler<void(), rocksdb::ThreadPool::StartBGThreads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1871
yugabyte#16 0x00007f1aa1efe732 in std::function<void ()>::operator()() const (this=0x1e39c78) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
yugabyte#17 0x00007f1a95e6bcf9 in yb::Thread::SuperviseThread (arg=0x1e39c20) at ../../src/yb/util/thread.cc:771
yugabyte#18 0x00007f1a91093694 in start_thread (arg=0x7f1a8e96c700) at pthread_create.c:333
yugabyte#19 0x00007f1a90dd541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```

Test Plan:
```
#!/usr/bin/env bash

set -euo pipefail

for i in {1..50}
do
  echo "Iteration: $i"
  rm -rf build/debug-gcc-dynamic-ninja/share/initial_sys_catalog_snapshot
  ybd --sj packaged
  ./bin/yb-ctl destroy
  ./bin/yb-ctl start
  ./bin/ysqlsh -c "SELECT 1"
done
./bin/yb-ctl destroy
```

Reviewers: bogdan, sergei, dmitry, mbautin

Reviewed By: mbautin

Subscribers: eng

Differential Revision: https://phabricator.dev.yugabyte.com/D10121
iSignal pushed a commit that referenced this pull request Sep 13, 2021
Summary:
Data Race Issue on Remote Bootstrap with rocksdb_dir:

```
[ts-4] WARNING: ThreadSanitizer: data race (pid=8370)
[ts-4]   Write of size 8 at 0x7b50002606a8 by thread T46 (mutexes: write M257685187220342284):
[ts-4]     #0 std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::assign(char const*, unsigned long) <null> (libc++.so.1+0xd5fa5)
[ts-4]     #1 std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::operator=(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) <null> (libc++.so.1+0xd5e8a)
[ts-4]     #2 yb::tablet::KvStoreInfo::LoadFromPB(yb::tablet::KvStoreInfoPB const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tablet/tablet_metadata.cc:201:15 (libtablet.so+0x36795c)
[ts-4]     #3 yb::tablet::RaftGroupMetadata::LoadFromSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tablet/tablet_metadata.cc:525:5 (libtablet.so+0x36b36b)
[ts-4]     #4 yb::tablet::RaftGroupMetadata::ReplaceSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tablet/tablet_metadata.cc:586:3 (libtablet.so+0x36c078)
[ts-4]     #5 yb::tserver::RemoteBootstrapClient::Finish() /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/remote_bootstrap_client.cc:421:3 (libtserver.so+0x1d1699)
[ts-4]     #6 yb::tserver::TSTabletManager::StartRemoteBootstrap(yb::consensus::StartRemoteBootstrapRequestPB const&) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/ts_tablet_manager.cc:1099:3 (libtserver.so+0x267088)
[ts-4]     #7 yb::tserver::ConsensusServiceImpl::StartRemoteBootstrap(yb::consensus::StartRemoteBootstrapRequestPB const*, yb::consensus::StartRemoteBootstrapResponsePB*, yb::rpc::RpcContext) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/tablet_service.cc:2767:31 (libtserver.so+0x21844b)
```

```
[ts-4]   Previous read of size 8 at 0x7b50002606a8 by thread T21:
[ts-4]     #0 std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__get_long_size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210813185027-9a29e26965-centos7-x86_64-clang7/installed/tsan/libcxx/include/c++/v1/string:1468:34 (libtablet.so+0x36d6a2)
[ts-4]     #1 std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210813185027-9a29e26965-centos7-x86_64-clang7/installed/tsan/libcxx/include/c++/v1/string:941 (libtablet.so+0x36d6a2)
[ts-4]     #2 std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::empty() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20210813185027-9a29e26965-centos7-x86_64-clang7/installed/tsan/libcxx/include/c++/v1/string:957 (libtablet.so+0x36d6a2)
[ts-4]     #3 yb::tablet::RaftGroupMetadata::data_root_dir() const /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tablet/tablet_metadata.cc:773 (libtablet.so+0x36d6a2)
[ts-4]     #4 yb::tserver::TSTabletManager::CreateReportedTabletPB(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::master::ReportedTabletPB*) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/ts_tablet_manager.cc:1840:68 (libtserver.so+0x26c6a2)
[ts-4]     #5 yb::tserver::TSTabletManager::GenerateTabletReport(yb::master::TabletReportPB*, bool) /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/ts_tablet_manager.cc:1925:5 (libtserver.so+0x26d072)
[ts-4]     #6 yb::tserver::Heartbeater::Thread::TryHeartbeat() /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/heartbeater.cc:371:32 (libtserver.so+0x1a2331)
[ts-4]     #7 yb::tserver::Heartbeater::Thread::DoHeartbeat() /nfusr/centos-gcp-cloud/jenkins-worker-r2ttyq/jenkins/jenkins-github-yugabyte-db-centos-master-clang7-tsan-339/build/tsan-clang7-dynamic-ninja/../../src/yb/tserver/heartbeater.cc:530:19 (libtserver.so+0x1a3678)
```

Test Plan:
ybd tsan --gtest_filter LoadBalancerMultiTableTest.GlobalLeaderBalancing
ybd tsan --gtest_filter LoadBalancerMultiTableTest.GlobalLoadBalancing

Reviewers: sergei

Reviewed By: sergei

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D12906
iSignal pushed a commit that referenced this pull request Dec 9, 2022
…e image

Summary:
We observed a crash while running TPCC workload with CDCSDK enabled.
The stack trace is:

```
(gdb) bt
#0  0x0000557f25b11910 in yb::DatumMessagePB::MergeFrom(yb::DatumMessagePB const&) ()
#1  0x0000557f258a41ef in yb::cdc::PopulateBeforeImage(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::ReadHybridTime const&, yb::cdc::RowMessage*, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::docdb::SubDocKey const&, yb::Schema const&, unsigned int) ()
#2  0x0000557f258a7304 in yb::cdc::PopulateCDCSDKIntentRecord(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> > const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, unsigned int*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, yb::Schema*, unsigned int, unsigned long const&) ()
#3  0x0000557f258aaa27 in yb::cdc::ProcessIntents(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, yb::cdc::StreamMetadata const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, yb::cdc::CDCSDKCheckpointPB*, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> >*, yb::docdb::ApplyTransactionState*, yb::client::YBClient*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, unsigned long const&) ()
#4  0x0000557f258b00c1 in yb::cdc::GetChangesForCDCSDK(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, yb::cdc::CDCSDKCheckpointPB const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::shared_ptr<yb::MemTracker> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::client::YBClient*, yb::consensus::ReplicateMsgsHolder*, yb::cdc::GetChangesResponsePB*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, yb::OpId*, long*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) ()
#5  0x0000557f2586c448 in yb::cdc::CDCServiceImpl::GetChanges(yb::cdc::GetChangesRequestPB const*, yb::cdc::GetChangesResponsePB*, yb::rpc::RpcContext) ()
#6  0x0000557f25908246 in std::__1::__function::__func<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3, std::__1::allocator<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#7  0x0000557f2590a6af in yb::cdc::CDCServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#8  0x0000557f26227a1e in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#9  0x0000557f2616db2f in yb::rpc::InboundCall::InboundCallTask::Run() ()
#10 0x0000557f26236583 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#11 0x0000557f268698cf in yb::Thread::SuperviseThread(void*) ()
yugabyte#12 0x00007fa6fce89694 in ?? ()
yugabyte#13 0x0000000000000000 in ?? ()
```

The problem is in the method: PopulateBeforeImage
When we drop a column, the the row won't have data for the dropped column, and hence will not be added to the "old_tuple" member of RowMessage. This will mean the size of "old_tuple" does not match the number of columns in the schema.
Which means this line: "row_message->old_tuple(static_cast<int>(index))" could lead to an out of bounds exception.
Instead,  now we are keeping track of the found columns in the row.

Test Plan: Running existing ctests

Reviewers: srangavajjula, sdash, skumar

Reviewed By: sdash, skumar

Differential Revision: https://phabricator.dev.yugabyte.com/D21338
iSignal pushed a commit that referenced this pull request Feb 1, 2024
…wuid function

Summary:
The are several unit tests which suffers from tsan data race warning with the following stack:

```
WARNING: ThreadSanitizer: data race (pid=38656)
  Read of size 8 at 0x7f6f2a44b038 by thread T21:
    #0 memcpy /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115:5 (pg_ddl_concurrency-test+0x9e197)
    #1 <null> <null> (libnss_sss.so.2+0x72ef) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
    #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279)
    #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b)
...

  Previous write of size 8 at 0x7f6f2a44b038 by thread T20 (mutexes: write M0):
    #0 mmap64 /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:7485:3 (pg_ddl_concurrency-test+0xda204)
    #1 <null> <null> (libnss_sss.so.2+0x7169) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
    #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279)
    #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b)
...

  Location is global '??' at 0x7f6f2a44b000 (passwd+0x38)

  Mutex M0 (0x7f6f2af29380) created at:
    #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1339:3 (pg_ddl_concurrency-test+0xa464b)
    #1 <null> <null> (libnss_sss.so.2+0x70d6) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
...
```

All failing tests has common feature - all of them creates connection to postgres from multiple threads at same time.
On creating new connection the `libpq` library calls the `getpwuid_r` standard function internally. This function is thread safe and tsan warning is not expected there.

Solution is to suppress warning in the `getpwuid_r` function.
**Note:** because there is no `getpwuid_r` function name in the tsan warning stack the warning for the caller function `pqGetpwuid` is suppressed.
Jira: DB-9523

Test Plan: Jenkins

Reviewers: sergei, bogdan

Reviewed By: sergei

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31646
iSignal pushed a commit that referenced this pull request Sep 3, 2024
…ugabyte#23065)

* initial commit for logical replication docs

* title changes

* changes to view table

* fixed line break

* fixed line break

* added content for delete and update

* added more content

* replaced hyperlink todos with reminders

* added snapshot metrics

* added more content

* added more config properties to docs

* added more config properties to docs

* added more config properties to docs

* replaced postgresql instances with yugabytedb

* added properties

* added complete properties

* changed postgresql to yugabytedb

* added example for all record types

* fixed highlighting of table header

* added type representations

* added type representations

* full content in now;

* full content in now;

* changed postgres references appropriately

* added a missing keyword

* changed name

* self review comments

* self review comments

* added section for logical replication

* added section for logical replication

* modified content for monitor page

* added content for monitoring

* rebased to master;

* CDC logical replication overview (#3)


Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com>

* advanced-topic (#5)


Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com>

* removed references to incremental and ad-hoc snapshots

* replaced index page with an empty one

* addressed review comments

* added getting started section

* added section for get started

* self review comments

* self review comments

* group review comments

* added hstore and domain type docs

* Advance configurations for CDC using logical replication (#2)

* Fix overview section (#7)

* Monitor section (#4)


Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com>

* Initial Snapshot content (#6)

* Add getting started (#1)

* Fix for broken note (#9)

* Fix the issue yaml parsing

Summary:
Fixes the issue yaml parsing. We changed the formatting for yaml list. This diff fixes the
usage for the same.

Test Plan:
Prepared alma9 node using ynp.
Verified universe creation.

Reviewers: vbansal, asharma

Reviewed By: asharma

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36711

* [PLAT-14534]Add regex match for GCP Instance template

Summary:
Added regex match for gcp instance template.
Regex taken from gcp documentation [[https://cloud.google.com/compute/docs/reference/rest/v1/instanceTemplates | here]].

Test Plan: Tested manually that validation fails with invalid characters.

Reviewers: #yba-api-review!, svarshney

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36543

* update diagram (yugabyte#23245)

* [/PLAT-14708] Fix JSON field name in TaskInfo query

Summary: This was missed when task params were moved out from details field.

Test Plan: Trivial - existing tests should succeed.

Reviewers: vbansal, cwang

Reviewed By: vbansal

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36705

* [yugabyte#23173] DocDB: Allow large bytes to be passed to RateLimiter

Summary:
RateLimiter has a debug assert that you cannot `Request` more than `GetSingleBurstBytes`. In release mode we do not perform this check and any call gets stuck forever. This change allows large bytes to be requested on RateLimiter. It does so by breaking requests larger than `GetSingleBurstBytes` into multiple smaller requests.

This change is a temporary fix to allow xCluster to operate without any issues. RocksDB RateLimiter has multiple enhancements over the years that would help avoid this and more starvation issues. Ex: facebook/rocksdb@cb2476a. We should consider pulling in those changes.

Fixes yugabyte#23173
Jira: DB-12112

Test Plan: RateLimiterTest.LargeRequests

Reviewers: slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D36703

* [yugabyte#23179] CDCSDK: Support data types with dynamically alloted oids in CDC

Summary:
This diff adds support for data types with dynamically alloted oids in CDC (for ex: hstore, enum array, etc). Such types contain invalid pg_type_oid for the corresponding columns in docdb schema.

In the current implemtation, in `ybc_pggate`, while decoding the cdc records we look at the `type_map_` to obtain YBCPgTypeEntity, which is then used for decoding. However the `type_map_` does not contain any entries for the data types with dynamically alloted oids. As a result, this causes segmentation fault. To prevent such crashes, CDC prevents addition of tables with such columns to the stream.

This diff removes the filtering logic and adds the tables to the stream even if it has such a type column. A function pointer will now be passed to `YBCPgGetCDCConsistentChanges`, which takes attribute number and the table_oid and returns the appropriate type entity by querying the `pg_type` catalog table. While decoding if a column is encountered with invalid pg_type_oid then, the passed function is invoked and type entity is obtained for decoding.

**Upgrade/Rollback safety:**
This diff adds a field `optional int32 attr_num` to DatumMessagePB. These changes are protected by the autoflag `ysql_yb_enable_replication_slot_consumption` which already exists but has not yet been released.
Jira: DB-12118

Test Plan:
Jenkins: urgent

All the existing cdc tests

./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#replicationConnectionConsumptionAllDataTypesWithYbOutput'

Reviewers: skumar, stiwary, asrinivasan, dmitry

Reviewed By: stiwary, dmitry

Subscribers: steve.varnau, skarri, yql, ybase, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36689

* [PLAT-14710] Do not return apiToken in response to getSessionInfo

Summary:
**Context**
The GET /session_info YBA API returns:
{
    "authToken": "…",
    "apiToken": "….",
    "apiTokenVersion": "….",
    "customerUUID": "uuid1",
    "userUUID": "useruuid1"
}

The apiToken and apiTokenVersion is supposed to be the last generated token that is valid. We had the following sequence of changes to this API.

https://yugabyte.atlassian.net/browse/PLAT-8028 - Do not store YBA token in YBA.

After the above fix, YBA does not store the apiToken anymore. So it cannot return it as part of the /session_info. The change for this ticket returned the hashed apiToken instead.

https://yugabyte.atlassian.net/browse/PLAT-14672 - getSessionInfo should generate and return api key in response

Since the hashed apiToken value is not useful to any client, and it broke YBM create cluster (https://yugabyte.atlassian.net/browse/CLOUDGA-22117), the first change for this ticket returned a new apiToken instead.

Note that GET /session_info is meant to get customer and user information for the currently authenticated session. This is useful for automation starting off an authenticated session from an existing/cached API token. It is not necessary for the /session_info API to return the authToken and apiToken. The client already has one of authToken or apiToken with which it invoked /session_info API. In fact generating a new apiToken whenever /session_info is called will invalidate the previous apiToken which would not be expected by the client. There is a different API /api_token to regenerate the apiToken explicitly.

**Fix in this change**
So the right behaviour is for /session_info to stop sending the apiToken in the response. In fact, the current behaviour of generating a new apiToken everytime will break a client (for example node-agent usage of /session_info here (https://github.com/yugabyte/yugabyte-db/blob/4ca56cfe27d1cae64e0e61a1bde22406e003ec04/managed/node-agent/app/server/handler.go#L19).

**Client impact of not returning apiToken in response of /session_info**

This should not impact any normal client that was using /session_info only to get the user uuid and customer uuid.

However, there might be a few clients (like YBM for example) that invoked /session_info to get the last generated apiToken from YBA. Unfortunately, this was a mis-use of this API. YBA generates the apiToken in response to a few entry point APIs like /register, /api_login and /api_token. The apiToken is long lived. YBA could choose to expire these apiTokens after a fixed amount of (long) time, but for now there is no expiration. The clients are expected to store the apiToken at their end and use the token to reestablish a session with YBA whenever needed. After establishinig a new session, clients would call GET /session_info to get the user uuid and customer uuid. This is getting fixed in YBM with https://yugabyte.atlassian.net/browse/CLOUDGA-22117. So this PLAT change should be taken up by YBM only after CLOUDGA-22117 is fixed.

Test Plan:
* Manually verified that session_info does not return authToken
* Shubham verified that node-agent works with this fix. Thanks Shubham!

Reviewers: svarshney, dkumar, tbedi, #yba-api-review!

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36712

* [docs] updates to CVE table status column (yugabyte#23225)

* updates to status column

* review comment

* format

---------

Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>

* [docs] Fix load balance keyword in drivers page (yugabyte#23253)

[docs] Fix `load_balance` -> `load-balance` in jdbc driver
[docs] Fix `load_balance` -> `loadBalance` in nodejs driver

* fixed compilation

* fix link, format

* format, links

* links, format

* format

* format

* minor edit

* best practice (#8)

* moved sections

* moved pages

* added key concepts page

* added link to getting started

* Dynamic table doc changes (#11)

* icons

* added box for lead link

* revert ybclient change

* revert accidental change

* revert accidental change

* revert accidental change

* fix link block for getting started page

* format

* minor edit

* links, format

* format

* links

* format

* remove reminder references

* Modified output plugin docs (yugabyte#12)

* Naming edits

* format

* review comments

* diagram

* review comment

* fix links

* format

* format

* link

* review comments

* copy to stable

* link

---------

Co-authored-by: siddharth2411 <43139012+siddharth2411@users.noreply.github.com>
Co-authored-by: Shubham <svarshney@yugabyte.com>
Co-authored-by: asharma-yb <asharma@yugabyte.com>
Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
Co-authored-by: Naorem Khogendro Singh <nsingh@yugabyte.com>
Co-authored-by: Hari Krishna Sunder <hari90@users.noreply.github.com>
Co-authored-by: Sumukh-Phalgaonkar <sumukhphalgaonkar@gmail.com>
Co-authored-by: Subramanian Neelakantan <sneelakantan@yugabyte.com>
Co-authored-by: Aishwarya Chakravarthy <ashchakravarthy@gmail.com>
Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
Co-authored-by: ddorian <dorian.hoxha@gmail.com>
Co-authored-by: Sumukh-Phalgaonkar <61342752+Sumukh-Phalgaonkar@users.noreply.github.com>
iSignal pushed a commit that referenced this pull request Sep 3, 2024
…build

Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    yugabyte#12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    yugabyte#13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    yugabyte#14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    yugabyte#15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    yugabyte#16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    yugabyte#17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    yugabyte#18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    yugabyte#19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    yugabyte#20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    yugabyte#21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    yugabyte#22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    yugabyte#23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D37111
iSignal pushed a commit that referenced this pull request Sep 10, 2024
…ng the lock

Summary:
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

yugabyte#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
yugabyte#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
yugabyte#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
yugabyte#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
yugabyte#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
yugabyte#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
yugabyte#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
yugabyte#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
yugabyte#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
yugabyte#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
yugabyte#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
yugabyte#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: hsunder

Subscribers: svc_phabricator, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37706
iSignal pushed a commit that referenced this pull request Nov 6, 2024
Summary:
It is possible for tablet peer's `tablet_` to be null when a rocksdb flush finishes. We call `tablet_->MaxPersistentOpId()` after flush to clean up recently applied transaction state, and this causes a SIGSEGV:
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string(this="", __str=<unavailable>) at string:898:9
    frame #1: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] yb::RWOperationCounter::resource_name(this=0x0000000000000378) const at operation_counter.h:95:12
    frame #2: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=0x0000000000000378, abort_status_holder=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.cc:190:62
    frame #3: 0x000055885b247ea6 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.h:140:9
    frame #4: 0x000055885b247e9f yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::tablet::Tablet::CreateScopedRWOperationBlockingRocksDbShutdownStart(this=0x0000000000000000, deadline=yb::CoarseTimePoint @ 0x00007f9455305d98) const at tablet.cc:3375:10
    frame #5: 0x000055885b247e90 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(this=0x0000000000000000, invalid_if_no_new_data=<unavailable>) const at tablet.cc:3540:32
    frame #6: 0x000055885b277f5e yb-tserver`yb::tablet::TabletPeer::MaxPersistentOpId(this=<unavailable>) const at tablet_peer.cc:946:23
    frame #7: 0x000055885b278e52 yb-tserver`non-virtual thunk to yb::tablet::TabletPeer::MaxPersistentOpId() const at tablet_peer.cc:0
    frame #8: 0x000055885b2dec44 yb-tserver`yb::tablet::TransactionParticipant::Impl::DoProcessRecentlyAppliedTransactions(this=0x0000153123151500, retryable_requests_flushed_op_id=<unavailable>, persist=<unavailable>) at transaction_participant.cc:2186:22
    frame #9: 0x000055885b2e0a8e yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions() [inlined] yb::tablet::TransactionParticipant::Impl::ProcessRecentlyAppliedTransactions(this=0x0000153123151500) at transaction_participant.cc:1440:27
    frame #10: 0x000055885b2e0a63 yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions(this=<unavailable>) at transaction_participant.cc:2629:17
    frame #11: 0x000055885b226093 yb-tserver`yb::tablet::Tablet::RocksDbListener::OnFlushCompleted(this=0x0000153110c2da58, (null)=<unavailable>, (null)=<unavailable>) at tablet.cc:503:34
    frame yugabyte#12: 0x000055885af0e507 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) at db_impl.cc:2121:19
    frame yugabyte#13: 0x000055885af0e275 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::FlushMemTableToOutputFile(this=0x0000153123150a80, cfd=0x000015317d651600, mutable_cf_options=0x00007f94553077d8, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048) at db_impl.cc:2008:3
    frame yugabyte#14: 0x000055885af0d859 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::BackgroundFlush(this=0x0000153123150a80, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048, cfd=0x000015317d651600) at db_impl.cc:3399:10
    frame yugabyte#15: 0x000055885af0d21f yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(this=0x0000153123150a80, cfd=<unavailable>) at db_impl.cc:3470:31
    frame yugabyte#16: 0x000055885b024a53 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() at thread_posix.cc:133:5
    frame yugabyte#17: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] rocksdb::ThreadPool::StartBGThreads(this=<unavailable>)::$_0::operator()() const at thread_posix.cc:172:5
    frame yugabyte#18: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::$_0&>()()) std::__1::__invoke[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads()::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:340:25
    frame yugabyte#19: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads(__args=<unavailable>)::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:415:5
    frame yugabyte#20: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)[abi:ue170006]() at function.h:192:16
    frame yugabyte#21: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)() at function.h:363:12
    frame yugabyte#22: 0x000055885b9c1543 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000015313de3b380)[abi:ue170006]() const at function.h:517:16
    frame yugabyte#23: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000015313de3b380)() const at function.h:1168:12
    frame yugabyte#24: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(arg=0x000015313de3b320) at thread.cc:866:3
    frame yugabyte#25: 0x00007f94994d81ca libpthread.so.0`start_thread + 234
    frame yugabyte#26: 0x00007f9499729e73 libc.so.6`__clone + 67
```

This diff adds a null check and returns `OpId::Min()` (i.e. don't clean anything up) if `tablet_` is null and we cannot call `MaxPersistentOpId`.
Jira: DB-12915

Test Plan: Jenkins

Reviewers: sergei, rthallam

Reviewed By: sergei, rthallam

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38323
iSignal pushed a commit that referenced this pull request Nov 6, 2024
Summary:
### Issue

Test ClockSynchronizationTest.TestClockSkewError fails with tsan failure

```
WARNING: ThreadSanitizer: data race (pid=226462)
  Read of size 8 at 0x7b4000000bf0 by thread T82:
    #0 boost::intrusive_ptr<yb::Status::State>::get() const ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 (libyb_util.so+0x3c5994)
    #1 bool boost::operator==<yb::Status::State>(boost::intrusive_ptr<yb::Status::State> const&, std::nullptr_t) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:263:14 (libyb_util.so+0x3c5994)
    #2 yb::Status::ok() const ${YB_SRC_ROOT}/src/yb/util/status.h:120:51 (libyb_util.so+0x3c5994)
    #3 yb::MockClock::Now() ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:141:3 (libyb_util.so+0x3c5994)
    #4 yb::server::HybridClock::NowWithError(yb::HybridTime*, unsigned long*) ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:155:22 (libserver_common.so+0xa5e12)
    #5 yb::server::HybridClock::NowRange() ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:144:3 (libserver_common.so+0xa5ceb)
    #6 yb::ClockBase::Now() ${YB_SRC_ROOT}/src/yb/common/clock.h:26:29 (libtserver.so+0x23a77a)
    #7 yb::tserver::Heartbeater::Thread::TryHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:437:41 (libtserver.so+0x23a77a)
    #8 yb::tserver::Heartbeater::Thread::DoHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:650:19 (libtserver.so+0x23d05f)
    #9 yb::tserver::Heartbeater::Thread::RunThread() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:697:16 (libtserver.so+0x23d74d)
    #10 decltype(*std::declval<yb::tserver::Heartbeater::Thread*&>().*std::declval<void (yb::tserver::Heartbeater::Thread::*&)()>()()) std::__invoke[abi:ue170006]<void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&, void>(void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&) ${YB_THIRDPARTY_DIR}/installed/tsan/libcxx/include/c++/v1/__type_traits/invoke.h:308:25 (libtserver.so+0x24206b)
...

  Previous write of size 8 at 0x7b4000000bf0 by main thread:
    #0 boost::intrusive_ptr<yb::Status::State>::swap(boost::intrusive_ptr<yb::Status::State>&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:210:16 (libyb_util.so+0x3c5c54)
    #1 boost::intrusive_ptr<yb::Status::State>::operator=(boost::intrusive_ptr<yb::Status::State>&&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:122:61 (libyb_util.so+0x3c5c54)
    #2 yb::Status::operator=(yb::Status&&) ${YB_SRC_ROOT}/src/yb/util/status.h:98:7 (libyb_util.so+0x3c5c54)
    #3 yb::MockClock::Set(yb::PhysicalTime const&) ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:147:16 (libyb_util.so+0x3c5c54)
    #4 yb::ClockSynchronizationTest_TestClockSkewError_Test::TestBody() ${YB_SRC_ROOT}/src/yb/integration-tests/clock_synchronization-itest.cc:131:15 (clock_synchronization-itest+0x12e3ca)
    #5 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2599:10 (libgtest.so.1.12.1+0x894f9)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2635:14 (libgtest.so.1.12.1+0x894f9)
    #7 testing::Test::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2674:5 (libgtest.so.1.12.1+0x6123f)
    #8 testing::TestInfo::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2853:11 (libgtest.so.1.12.1+0x62a05)
    #9 testing::TestSuite::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:3012:30 (libgtest.so.1.12.1+0x63f04)
    #10 testing::internal::UnitTestImpl::RunAllTests() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:5870:44 (libgtest.so.1.12.1+0x7be3d)
...

**SUMMARY**: ThreadSanitizer: data race ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 in boost::intrusive_ptr<yb::Status::State>::get() const
```

### Fix

Do what value_ does => wrap mock_status_ in boost::atomic.
Jira: DB-13604

Test Plan:
Jenkins

Ran

```
./yb_build.sh tsan --cxx-test integration-tests_clock_synchronization-itest --gtest_filter ClockSynchronizationTest.TestClockSkewError -n 50
```

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39315
ddhodge pushed a commit that referenced this pull request Nov 22, 2024
…l connection manager

Summary:
Set pthread_attr_setstacksize to 512 KB in ysql connection manager. This is to fix crashes involving Alma 9 machines.
```
#0  0x000055e430191fd7 in tcmalloc::tcmalloc_internal::PageTracker::Get(tcmalloc::tcmalloc_internal::Length) ()
#1  0x000055e430192774 in tcmalloc::tcmalloc_internal::HugePageFiller<tcmalloc::tcmalloc_internal::PageTracker>::TryGet(tcmalloc::tcmalloc_internal::Length, unsigned long)
    ()
#2  0x000055e430160b3a in tcmalloc::tcmalloc_internal::HugePageAwareAllocator::New(tcmalloc::tcmalloc_internal::Length, unsigned long) ()
#3  0x000055e4301437a2 in void* tcmalloc::tcmalloc_internal::SampleifyAllocation<tcmalloc::tcmalloc_internal::Static, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy> >(tcmalloc::tcmalloc_internal::Static&, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, unsigned long, unsigned long, void*, tcmalloc::tcmalloc_internal::Span*, unsigned long*) ()
#4  0x000055e43014339b in void* slow_alloc<tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, decltype(nullptr)>(tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, decltype(nullptr))
    ()
#5  0x000055e4301404e6 in malloc ()
#6  0x00007fc3a67b1d7e in ssl3_setup_write_buffer () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#7  0x00007fc3a67ae824 in do_ssl3_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#8  0x00007fc3a67ae3e1 in ssl3_write_bytes () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#9  0x00007fc3a67cc251 in ssl3_do_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#10 0x00007fc3a67c1a6a in state_machine () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#11 0x000055e43013d303 in mm_tls_handshake_cb (handle=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/tls.c:453
yugabyte#12 0x000055e43013b9e7 in mm_epoll_step (poll=0x37beffd71a60, timeout=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/epoll.c:79
yugabyte#13 0x000055e43013b386 in mm_loop_step (loop=0x37beffd72980) at ../../src/odyssey/third_party/machinarium/sources/loop.c:64
yugabyte#14 machine_main (arg=0x37beffd72780) at ../../src/odyssey/third_party/machinarium/sources/machine.c:56
yugabyte#15 0x00007fc3a5e89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
yugabyte#16 0x00007fc3a5f0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
Note that, this 512 KB value is the same as the value used in tserver and master processes via the min_thread_stack_size_bytes GFlag introduced in https://phorge.dev.yugabyte.com/D38053.
Jira: DB-13388

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, stiwary

Reviewed By: stiwary

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D40087
iSignal pushed a commit that referenced this pull request Dec 3, 2024
… failure in tsan build

Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    yugabyte#12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    yugabyte#13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    yugabyte#14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    yugabyte#15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    yugabyte#16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    yugabyte#17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    yugabyte#18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    yugabyte#19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    yugabyte#20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    yugabyte#21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    yugabyte#22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    yugabyte#23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Original commit: bd4874b / D37111

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37167
iSignal pushed a commit that referenced this pull request Dec 3, 2024
…alled while holding the lock

Summary:
Original commit: c770d79 / D37706
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

yugabyte#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
yugabyte#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
yugabyte#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
yugabyte#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
yugabyte#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
yugabyte#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
yugabyte#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
yugabyte#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
yugabyte#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
yugabyte#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
yugabyte#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
yugabyte#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: rthallam

Subscribers: ybase, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D37788
iSignal pushed a commit that referenced this pull request Jan 14, 2025
…kOperation::GetDocPaths

Summary:
```
../../src/yb/docdb/conflict_resolution.cc:865:5: runtime error: load of value 4458368, which is not a valid value for type 'IsolationLevel'
[m-1] W1212 08:27:12.423846 99051 master_heartbeat_service.cc:426] Could not get YSQL db catalog versions for heartbeat response:
[m-1] W1212 08:27:12.425787 99052 master_heartbeat_service.cc:426] Could not get YSQL db catalog versions for heartbeat response:
    #0 0x7fd9ae920c0e in yb::docdb::(anonymous namespace)::GetWriteRequestIntents(std::vector<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>, std::allocator<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>>> const&, yb::dockv::KeyBytes*, yb::StronglyTypedBool<yb::dockv::PartialRangeKeyIntents_Tag>, yb::IsolationLevel) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:865:5
    #1 0x7fd9ae9190ce in yb::docdb::(anonymous namespace)::TransactionConflictResolverContext::GetRequestedIntents(yb::docdb::(anonymous namespace)::ConflictResolver*, yb::dockv::KeyBytes*) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1130:22
    #2 0x7fd9ae90fcee in yb::docdb::(anonymous namespace)::TransactionConflictResolverContext::ReadConflicts(yb::docdb::(anonymous namespace)::ConflictResolver*) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1164:22
    #3 0x7fd9ae8fd333 in yb::docdb::(anonymous namespace)::ConflictResolver::Resolve() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:198:26
    #4 0x7fd9ae8fc3b2 in yb::docdb::(anonymous namespace)::WaitOnConflictResolver::TryPreWait() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:697:25
    #5 0x7fd9ae8fc3b2 in yb::docdb::(anonymous namespace)::WaitOnConflictResolver::Run() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:670:7
    #6 0x7fd9ae8fa47d in yb::docdb::ResolveTransactionConflicts(std::vector<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>, std::allocator<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>>> const&, yb::docdb::ConflictManagementPolicy, yb::docdb::LWKeyValueWriteBatchPB const&, yb::HybridTime, yb::HybridTime, long, unsigned long, long, yb::docdb::DocDB const&, yb::StronglyTypedBool<yb::dockv::PartialRangeKeyIntents_Tag>, yb::TransactionStatusManager*, yb::tablet::TabletMetrics*, yb::docdb::LockBatch*, yb::docdb::WaitQueue*, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, boost::function<void (yb::Result<yb::HybridTime> const&)>) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1401:15
    #7 0x7fd9aff22ca9 in yb::tablet::WriteQuery::DoExecute() ${YB_SRC_ROOT}/src/yb/tablet/write_query.cc:801:10
    #8 0x7fd9aff1fda3 in yb::tablet::WriteQuery::Execute(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/write_query.cc:618:28
    #9 0x7fd9afb9b708 in yb::tablet::Tablet::AcquireLocksAndPerformDocOperations(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/tablet.cc:2147:3
    #10 0x7fd9afd166d7 in yb::tablet::TabletPeer::WriteAsync(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/tablet_peer.cc:704:12
    #11 0x7fd9b0fd4d57 in yb::tserver::TabletServiceImpl::PerformWrite(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext*) ${YB_SRC_ROOT}/src/yb/tserver/tablet_service.cc:2325:16
    yugabyte#12 0x7fd9b0fd711a in yb::tserver::TabletServiceImpl::Write(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext) ${YB_SRC_ROOT}/src/yb/tserver/tablet_service.cc:2345:17
    yugabyte#13 0x7fd9a91098ec in yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)::operator()(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext) const ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:848:9
    yugabyte#14 0x7fd9a91098ec in auto yb::rpc::HandleCall<yb::rpc::RpcCallPBParamsImpl<yb::tserver::WriteRequestPB, yb::tserver::WriteResponsePB>, yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)>(std::shared_ptr<yb::rpc::InboundCall>, yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)) ${YB_SRC_ROOT}/src/yb/rpc/local_call.h:126:7
    yugabyte#15 0x7fd9a91098ec in yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:846:7
    yugabyte#16 0x7fd9a91098ec in decltype(std::declval<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&>()(std::declval<std::shared_ptr<yb::rpc::InboundCall>>())) std::__invoke[abi:ue170006]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>>(yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__type_traits/invoke.h:340:25
    yugabyte#17 0x7fd9a91098ec in void std::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>>(yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__type_traits/invoke.h:415:5
    yugabyte#18 0x7fd9a91098ec in std::__function::__alloc_func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ue170006](std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:192:16
    yugabyte#19 0x7fd9a91098ec in std::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:363:12
    yugabyte#20 0x7fd9a9108b34 in std::__function::__value_func<void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ue170006](std::shared_ptr<yb::rpc::InboundCall>&&) const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:517:16
    yugabyte#21 0x7fd9a9108b34 in std::function<void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::shared_ptr<yb::rpc::InboundCall>) const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:1168:12
    yugabyte#22 0x7fd9a9108b34 in yb::tserver::TabletServerServiceIf::Handle(std::shared_ptr<yb::rpc::InboundCall>) ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:831:3
    yugabyte#23 0x7fd9a5b7a892 in yb::rpc::ServicePoolImpl::Handle(std::shared_ptr<yb::rpc::InboundCall>) ${YB_SRC_ROOT}/src/yb/rpc/service_pool.cc:269:19
    yugabyte#24 0x7fd9a59fcea5 in yb::rpc::InboundCall::InboundCallTask::Run() ${YB_SRC_ROOT}/src/yb/rpc/inbound_call.cc:317:13
    yugabyte#25 0x7fd9a5bad15d in yb::rpc::(anonymous namespace)::Worker::Execute() ${YB_SRC_ROOT}/src/yb/rpc/thread_pool.cc:115:15
    yugabyte#26 0x7fd9a42ad037 in std::__function::__value_func<void ()>::operator()[abi:ue170006]() const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:517:16
    yugabyte#27 0x7fd9a42ad037 in std::function<void ()>::operator()() const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:1168:12
    yugabyte#28 0x7fd9a42ad037 in yb::Thread::SuperviseThread(void*) ${YB_SRC_ROOT}/src/yb/util/thread.cc:895:3
    yugabyte#29 0x56176114abea in asan_thread_start(void*) ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:225:31
    yugabyte#30 0x7fd99f1071c9 in start_thread (/lib64/libpthread.so.0+0x81c9) (BuildId: 1962602ac5dc3011b6d697b38b05ddc244197114)
    yugabyte#31 0x7fd99eb488d2 in clone (/lib64/libc.so.6+0x398d2) (BuildId: 37e4ac6a7fb96950b0e6bf72d73d94f3296c77eb)

UndefinedBehaviorSanitizer: undefined-behavior ../../src/yb/docdb/conflict_resolution.cc:865:5 in
```

IsolationLevel is left uninitialized at `PgsqlLockOperation::GetDocPaths` so the ASAN builds can get the `not a valid value` faliure.
Jira: DB-14458

Test Plan: advisory_lock-test

Reviewers: bkolagani

Reviewed By: bkolagani

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D40638
iSignal pushed a commit that referenced this pull request Jan 14, 2025
…work

Summary:
Fix asan failure for tests running the isolation regress
framework: org.yb.pgsql.TestPgRegressWaitQueues,
org.yb.pgsql.TestPgRegressIsolationWithoutWaitQueues and
org.yb.pgsql.TestPgRegressIsolation.

The failure is:
```
+=================================================================
+==31476==ERROR: LeakSanitizer: detected memory leaks
+
+Direct leak of 864 byte(s) in 4 object(s) allocated from:
+    #0 0x55fc1116466e in malloc /opt/yb-build/llvm/yb-llvm-v17.0.6-yb-1-1720414757-9b881774-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:69:3
+    #1 0x7f94aa200490 in PQmakeEmptyPGresult /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:164:24
+    #2 0x7f94aa21df9d in getRowDescriptions /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-protocol3.c
+    #3 0x7f94aa21c0bc in pqParseInput3 /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-protocol3.c:324:11
+    #4 0x7f94aa207028 in parseInput /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2014:2
+    #5 0x7f94aa207028 in PQgetResult /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2100:3
+    #6 0x7f94aa208437 in PQexecFinish /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2417:19
+    #7 0x7f94aa208437 in PQexecParams /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2279:9
+    #8 0x55fc111a2881 in main /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/test/isolation/../../../../../../src/postgres/src/test/isolation/isolationtester.c:201:9
+    #9 0x7f94a8caa7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 37e4ac6a7fb96950b0e6bf72d73d94f3296c77eb)
+
+Objects leaked above:
+0x511000004640 (216 bytes)
+0x511000008ec0 (216 bytes)
+0x5110000133c0 (216 bytes)
+0x5110000179c0 (216 bytes)
```
Jira: DB-13172

Test Plan: Jenkins: test regex: .*RegressIsolation.*|.*RegressWaitQueues.*

Reviewers: patnaik.balivada

Reviewed By: patnaik.balivada

Subscribers: jason, yql

Differential Revision: https://phorge.dev.yugabyte.com/D40987
iSignal pushed a commit that referenced this pull request Feb 12, 2025
…to abort transactions

Summary:
One of the stress tests faced a crash with the following trace
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGABRT
  * frame #0: 0x00007f4ecfd66acf libc.so.6`raise + 271
    frame #1: 0x00007f4ecfd39ea5 libc.so.6`abort + 295
    frame #2: 0x00005606b0891403 yb-server`abort_message + 195
    frame #3: 0x00005606b0890f9c yb-server`demangling_terminate_handler() + 268
    frame #4: 0x00005606b0890c66 yb-server`std::__terminate(void (*)()) + 6
    frame #5: 0x00005606b0892bab yb-server`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 27
    frame #6: 0x00005606b0892b3f yb-server`__cxa_throw + 111
    frame #7: 0x00005606ae47cd1e yb-server`std::__1::__throw_bad_weak_ptr[abi:ue170006]() at shared_ptr.h:137:5
    frame #8: 0x00005606af954382 yb-server`yb::tablet::TransactionParticipant::Impl::Abort(yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::function<void (yb::Result<yb::TransactionStatusResult>)>) [inlined] std::__1::shared_ptr<yb::tablet::RunningTransaction>::shared_ptr[abi:ue170006]<yb::tablet::RunningTransaction, void>(this=<unavailable>, __r=std::__1::weak_ptr<yb::tablet::RunningTransaction>::element_type @ 0x0000162400000001) at shared_ptr.h:704:13
    frame #9: 0x00005606af954334 yb-server`yb::tablet::TransactionParticipant::Impl::Abort(yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::function<void (yb::Result<yb::TransactionStatusResult>)>) [inlined] std::__1::enable_shared_from_this<yb::tablet::RunningTransaction>::shared_from_this[abi:ue170006](this=0x00001624cd5da018) at shared_ptr.h:1954:17
    frame #10: 0x00005606af954334 yb-server`yb::tablet::TransactionParticipant::Impl::Abort(yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::function<void (yb::Result<yb::TransactionStatusResult>)>) [inlined] yb::tablet::RunningTransaction::Abort(this=0x00001624cd5da018, client=0x00001624fd7d7f10, callback=yb::TransactionStatusCallback @ 0x00007f4babc39700, lock=0x00007f4babc396e0)>, std::__1::unique_lock<std::__1::mutex>*) at running_transaction.cc:200:34
    frame #11: 0x00005606af953ccf yb-server`yb::tablet::TransactionParticipant::Impl::Abort(this=<unavailable>, id=<unavailable>, callback=<unavailable>)>) at transaction_participant.cc:707:45
    frame yugabyte#12: 0x00005606af95d7d4 yb-server`yb::tablet::TransactionParticipant::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*) at transaction_participant.cc:1355:7
    frame yugabyte#13: 0x00005606af95d3e5 yb-server`yb::tablet::TransactionParticipant::StopActiveTxnsPriorTo(this=<unavailable>, cutoff=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4babc398b8, exclude_txn_id=<unavailable>) at transaction_participant.cc:2700:17
```

This suggests an issue where the underlying `RunningTransaction` is being destroyed and we are trying to call `shared_from_this()` post that. This happens as we release the transaction participant's lock before creating a shared ref for the `RunningTransaction` instance we are trying to abort.

This diff fixes the issue by creating the shared_ref first before releasing the participant's mutex, and then using it later.
Jira: DB-14948

Test Plan: Jenkins

Reviewers: esheng

Reviewed By: esheng

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D41384
iSignal pushed a commit that referenced this pull request Mar 11, 2025
Summary:
FK constraint has the following optimization for performing batched read:
- FK constraint registers ybctids which need to be checked
- on checking particular ybctid for particular FK constraint all registered ybctds are read at once, result is cached
- checking ybctid for another FK constraint will check cached result instead of making real read request.

In Postgres constraint could be of 2 types:
- IMMEDIATE (default). Checked at statement end.
- DEFERRED. Checked at transaction end.

Nowadays FK optimization work incorrectly in case multiple constraint of different types is used in same transaction. And the reason is that YSQL registers ybctids for both types of constraint in single map.

Example:

```
1. CREATE TABLE pk_t(k INT PRIMARY KEY);
2. CREATE TABLE fk_t(k INT PRIMARY KEY,
                     pk_1 INT REFERENCES pk_t(k),
                     pk_2 INT REFERENCES pk_t(k) DEFERRABLE INITIALLY DEFERRED);
3. INSERT INTO pk_t VALUES (1);
4. BEGIN;
5. INSERT INTO fk_t VALUES(1, 1, 2);
6. INSERT INTO pk_t VALUES(2);
7. COMMIT;
```

- On step #5 YSQL inserts value `(1, 1, 2)` into table with 2 FK referenced columns. Where constraint for second column is `DEFERRED`.
- Both constraint registers ybctid for rows `k = 1` and `k = 2` in table `pk_t`.
- Because constraint for first column is non deferred is it executed immediately (at the end of the statement).
- Due to optimization both registered ybctids will be read at once. And result will be cached. And the result contains `k = 1` only, because `k = 2` is only inserted on step #6
- On step #7 YSQL will perform the check of constraint for second column and cached result will be used which doesn't have `k = 2` inserted on step #6

Solution is to store ybctids for deferred and non-deferred constraint in different structure. And read them independently. All the ybctids registered for deferred constraints will be read only on transaction commit step (step #7).
For this purpose the new `YBCNotifyDeferredTriggersProcessingStarted()` function is introduced. Which is called straight before deferred triggers firing at the beginning of `COMMIT` command processing.
Jira: DB-14665

Test Plan:
Jenkins

New unit test are introduced

```
./yb_build.sh --gtest_filter PgFKeyTest.DeferredConstraintReadAtTxnEnd
```

Reviewers: pjain, myang, kramanathan, patnaik.balivada

Reviewed By: pjain

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D40896
iSignal pushed a commit that referenced this pull request Apr 1, 2025
Summary:
After commit f85bbca, vmodule flag is no longer respected by postgres process, for example:
```
ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k
zgrep pg_sample ~/logs/latest_test/1.log
```
shows no vlogs.

The reason is that `VLOG(1)` is used early by
```
#0  0x00007f7e1b48b090 in google::InitVLOG3__(google::SiteFlag*, int*, char const*, int)@plt () from /net/dev-server-timur/share/code/yugabyte-db/build/debug-clang19-dynamic-ninja/lib/libyb_util_shmem.so
#1  0x00007f7e1b47616e in yb::(anonymous namespace)::NegotiatorSharedState::WaitProposal (this=0x7f7e215e8000) at ../../src/yb/util/shmem/reserved_address_segment.cc:108
#2  0x00007f7e1b4781e0 in yb::AddressSegmentNegotiator::Impl::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:252
#3  0x00007f7e1b4737ce in yb::AddressSegmentNegotiator::NegotiateChild (fd=45) at ../../src/yb/util/shmem/reserved_address_segment.cc:376
#4  0x00007f7e1b742b7b in yb::tserver::SharedMemoryManager::InitializePostmaster (this=0x7f7e202e9788 <yb::pggate::PgSharedMemoryManager()::shared_mem_manager>, fd=45) at ../../src/yb/tserver/tserver_shared_mem.cc:252
#5  0x00007f7e2023588f in yb::pggate::PgSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/pg_shared_mem.cc:29
#6  0x00007f7e202788e9 in YBCSetupSharedMemoryAddressSegment () at ../../src/yb/yql/pggate/ybc_pg_shared_mem.cc:22
#7  0x000055636b8956f5 in PostmasterMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1083
#8  0x000055636b774bfe in PostgresServerProcessMain (argc=21, argv=0x52937fe4e790) at ../../../../../../src/postgres/src/backend/main/main.c:209
#9  0x000055636b7751f2 in main ()
```
and caches `vmodule` value before `InitGFlags` sets it from environment.

The fix is to explicitly call `UpdateVmodule` from `InitGFlags` after setting `vmodule`.
Jira: DB-15888

Test Plan:
```
ybd release --cxx-test pgwrapper_pg_analyze-test --gtest_filter PgAnalyzeTest.AnalyzeSamplingColocated --test-args '--vmodule=pg_sample=1' -n 2 -- -p 1 -k
zgrep pg_sample ~/logs/latest_test/1.log
```

Reviewers: hsunder

Reviewed By: hsunder

Subscribers: ybase, yql

Tags: #jenkins-ready, #jenkins-trigger

Differential Revision: https://phorge.dev.yugabyte.com/D42731
iSignal pushed a commit that referenced this pull request Apr 15, 2025
iSignal pushed a commit that referenced this pull request Apr 29, 2025
…rdup for tablegroup_name

Summary:
As part of D36859 / 0dbe7d6, backup and restore support for colocated tables when multiple tablespaces exist was introduced. Upon
fetching the tablegroup_name from `pg_yb_tablegroup`, the value was read and assigned via `PQgetvalue` without copying. This led to a use-after-free bug when the
tablegroup_name was later read in dumpTableSchema since the result from the SQL query is immediately cleared in the next line (`PQclear`).

```
[P-yb-controller-1] ==3037==ERROR: AddressSanitizer: heap-use-after-free on address 0x51d0002013e6 at pc 0x55615b0a1f92 bp 0x7fff92475970 sp 0x7fff92475118
[P-yb-controller-1] READ of size 8 at 0x51d0002013e6 thread T0
[P-yb-controller-1]     #0 0x55615b0a1f91 in strcmp ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:470:5
[P-yb-controller-1]     #1 0x55615b1b90ba in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15789:8
[P-yb-controller-1]     #2 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     #3 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     #4 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     #5 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
[P-yb-controller-1]     #6 0x55615b0894bd in _start (${BUILD_ROOT}/postgres/bin/ysql_dump+0x10d4bd)
[P-yb-controller-1]
[P-yb-controller-1] 0x51d0002013e6 is located 358 bytes inside of 2048-byte region [0x51d000201280,0x51d000201a80)
[P-yb-controller-1] freed by thread T0 here:
[P-yb-controller-1]     #0 0x55615b127196 in free ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
[P-yb-controller-1]     #1 0x7f3c02d65e85 in PQclear ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:755:3
[P-yb-controller-1]     #2 0x55615b1c0103 in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19108:4
[P-yb-controller-1]     #3 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3
[P-yb-controller-1]     #4 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     #5 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     #6 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     #7 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
[P-yb-controller-1]
[P-yb-controller-1] previously allocated by thread T0 here:
[P-yb-controller-1]     #0 0x55615b12742f in malloc ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
[P-yb-controller-1]     #1 0x7f3c02d680a7 in pqResultAlloc ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:633:28
[P-yb-controller-1]     #2 0x7f3c02d81294 in getRowDescriptions ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:544:4
[P-yb-controller-1]     #3 0x7f3c02d7f793 in pqParseInput3 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-protocol3.c:324:11
[P-yb-controller-1]     #4 0x7f3c02d6bcc8 in parseInput ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2014:2
[P-yb-controller-1]     #5 0x7f3c02d6bcc8 in PQgetResult ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2100:3
[P-yb-controller-1]     #6 0x7f3c02d6cd87 in PQexecFinish ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2417:19
[P-yb-controller-1]     #7 0x7f3c02d6cd87 in PQexec ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/fe-exec.c:2256:9
[P-yb-controller-1]     #8 0x55615b1f45df in ExecuteSqlQuery ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:296:8
[P-yb-controller-1]     #9 0x55615b1f4213 in ExecuteSqlQueryForSingleRow ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_backup_db.c:311:8
[P-yb-controller-1]     #10 0x55615b1c008d in getYbTablePropertiesAndReloptions ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:19102:10
[P-yb-controller-1]     #11 0x55615b1b8fab in dumpTableSchema ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15765:3
[P-yb-controller-1]     yugabyte#12 0x55615b178163 in dumpTable ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:15299:4
[P-yb-controller-1]     yugabyte#13 0x55615b178163 in dumpDumpableObject ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:10216:4
[P-yb-controller-1]     yugabyte#14 0x55615b178163 in main ${YB_SRC_ROOT}/src/postgres/src/bin/pg_dump/pg_dump.c:1019:3
[P-yb-controller-1]     yugabyte#15 0x7f3c0184e7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: fd70eb98f80391a177070fcb8d757a63fe49b802)
```

This revision fixes the issue by using pg_strdup to make a copy of the string.
Jira: DB-15915

Test Plan: ./yb_build.sh asan --cxx-test integration-tests_xcluster_ddl_replication-test --gtest_filter XClusterDDLReplicationTest.DDLReplicationTablesNotColocated

Reviewers: aagrawal, skumar, mlillibridge, sergei

Reviewed By: aagrawal, sergei

Subscribers: sergei, yql

Differential Revision: https://phorge.dev.yugabyte.com/D43386
ddhodge pushed a commit that referenced this pull request Jun 14, 2025
…ck/release functions at TabletService

Summary:
In functions `TabletServiceImpl::AcquireObjectLocks` and `TabletServiceImpl::ReleaseObjectLocks`, we weren't returning after executing the rpc callback with initial validation steps fail. This led to segv issues like below
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined]
std::__1::unique_ptr<yb::tserver::TSLocalLockManager::Impl, std::__1::default_delete<yb::tserver::TSLocalLockManager::Impl>>::operator->[abi:ne190100](this=0x0000000000000000) const at unique_ptr.h:272:108
    frame #1: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext) [inlined]
yb::tserver::TSLocalLockManager::AcquireObjectLocksAsync(this=0x0000000000000000, req=0x00005001bfffa290, deadline=yb::CoarseTimePoint @ x23, callback=0x0000ffefb6066560, wait=(value_ = true)) at ts_local_lock_manager.cc:541:3
    frame #2: 0x0000aaaac351e5f0 yb-tserver`yb::tserver::TabletServiceImpl::AcquireObjectLocks(this=0x00005001bdaf6020, req=0x00005001bfffa290, resp=0x00005001bfffa300, context=<unavailable>) at tablet_service.cc:3673:26
    frame #3: 0x0000aaaac36bd9a0 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
yb::tserver::TabletServerServiceIf::InitMethods(this=<unavailable>, req=0x00005001bfffa290, resp=0x00005001bfffa300, rpc_context=RpcContext @ 0x0000ffefb6066600)::$_36::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>)
const::'lambda'(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*, yb::rpc::RpcContext)::operator()(yb::tserver::AcquireObjectLockRequestPB const*, yb::tserver::AcquireObjectLockResponsePB*,
yb::rpc::RpcContext) const at tserver_service.service.cc:1470:9
    frame #4: 0x0000aaaac36bd978 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) at
local_call.h:126:7
    frame #5: 0x0000aaaac36bd680 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36::operator()(this=<unavailable>, call=<unavailable>) const at tserver_service.service.cc:1468:7
    frame #6: 0x0000aaaac36bd5c8 yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
decltype(std::declval<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&>()(std::declval<std::__1::shared_ptr<yb::rpc::InboundCall>>()))
std::__1::__invoke[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__f=<unavailable>, __args=<unavailable>) at invoke.h:149:25
    frame #7: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] void
std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190100]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36&, std::__1::shared_ptr<yb::rpc::InboundCall>>(__args=<unavailable>,
__args=<unavailable>) at invoke.h:224:5
    frame #8: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined]
std::__1::__function::__alloc_func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36, std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity>
const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __arg=<unavailable>) at function.h:171:12
    frame #9: 0x0000aaaac36bd5bc yb-tserver`std::__1::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36,
std::__1::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_36>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=<unavailable>) at
function.h:313:10
    frame #10: 0x0000aaaac36d1384 yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::__function::__value_func<void
(std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ne190100](this=<unavailable>, __args=nullptr) const at function.h:430:12
    frame #11: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::function<void
(std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(this=<unavailable>, __arg=nullptr) const at function.h:989:10
    frame yugabyte#12: 0x0000aaaac36d136c yb-tserver`yb::tserver::TabletServerServiceIf::Handle(this=<unavailable>, call=<unavailable>) at tserver_service.service.cc:913:3
    frame yugabyte#13: 0x0000aaaac30e05b4 yb-tserver`yb::rpc::ServicePoolImpl::Handle(this=0x00005001bff9b8c0, incoming=nullptr) at service_pool.cc:275:19
    frame yugabyte#14: 0x0000aaaac3006ed0 yb-tserver`yb::rpc::InboundCall::InboundCallTask::Run(this=<unavailable>) at inbound_call.cc:309:13
    frame yugabyte#15: 0x0000aaaac30ec868 yb-tserver`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x00005001bff5c640, task=0x00005001bfdf1958) at thread_pool.cc:138:13
    frame yugabyte#16: 0x0000aaaac39afd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x00005001bfe1e750) const at function.h:430:12
    frame yugabyte#17: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x00005001bfe1e750) const at function.h:989:10
    frame yugabyte#18: 0x0000aaaac39afd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x00005001bfe1e6e0) at thread.cc:937:3
```

This revision addresses the issue by returning after executing the rpc callback with validation failure status.
Jira: DB-17124

Test Plan: Jenkins

Reviewers: rthallam, amitanand

Reviewed By: amitanand

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D44663
ddhodge pushed a commit that referenced this pull request Jun 14, 2025
…own flags are set at ObjectLockManager

Summary:
In context of object locking, commit 6e80c56 / D44228 got rid of logic that signaled obsolete waiters corresponding to transactions that issued a release all locks request (could have been terminated to failures like timeout, deadlock etc) in order to early terminate failed waiting requests. Hence, now we let the obsolete requests terminate organically from the OLM resumed by the poller thread that runs at an interval of `olm_poll_interval_ms` (defaults to 100ms).

This led to one of the itests failing with the below stack
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(std::__1::function<void ()>) [inlined] yb::ThreadPoolToken::Submit(this=<unavailable>, r=<unavailable>) at threadpool.cc:146:10
    frame #1: 0x0000aaaac8a093ec yb-tserver`yb::ThreadPoolToken::SubmitFunc(this=0x0000000000000000, f=<unavailable>) at threadpool.cc:142:10
    frame #2: 0x0000aaaac73cdfe8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoSignal(this=0x00003342bfa0d400, entry=<unavailable>) at object_lock_manager.cc:767:3
    frame #3: 0x0000aaaac73cc7c0 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>, yb::docdb::LockData&&, yb::StronglyTypedBool<yb::docdb::(anonymous
namespace)::IsLockRetry_Tag>, unsigned long, yb::Status) [inlined] yb::docdb::ObjectLockManagerImpl::PrepareAcquire(this=0x00003342bfa0d400, txn_lock=<unavailable>, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous
namespace)::TrackedTransactionLockEntry>::element_type @ 0x00003342bfa94a38, data=0x00003342b9a6a830, resume_it_offset=<unavailable>, resume_with_status=<unavailable>) at object_lock_manager.cc:523:5
    frame #4: 0x0000aaaac73cc6a8 yb-tserver`yb::docdb::ObjectLockManagerImpl::DoLock(this=0x00003342bfa0d400, transaction_entry=std::__1::shared_ptr<yb::docdb::(anonymous namespace)::TrackedTransactionLockEntry>::element_type @
0x00003342bfa94a38, data=0x00003342b9a6a830, is_retry=(value_ = true), resume_it_offset=<unavailable>, resume_with_status=Status @ 0x0000ffefaa036658) at object_lock_manager.cc:552:27
    frame #5: 0x0000aaaac73cbcb4 yb-tserver`yb::docdb::WaiterEntry::Resume(this=0x00003342b9a6a820, lock_manager=0x00003342bfa0d400, resume_with_status=<unavailable>) at object_lock_manager.cc:381:17
    frame #6: 0x0000aaaac85bdd4c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() at object_lock_manager.cc:752:13
    frame #7: 0x0000aaaac85bda74 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::docdb::ObjectLockManager::Shutdown(this=<unavailable>) at object_lock_manager.cc:1092:10
    frame #8: 0x0000aaaac85bda6c yb-tserver`yb::tserver::TSLocalLockManager::Shutdown() [inlined] yb::tserver::TSLocalLockManager::Impl::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:411:26
    frame #9: 0x0000aaaac85bd7e8 yb-tserver`yb::tserver::TSLocalLockManager::Shutdown(this=<unavailable>) at ts_local_lock_manager.cc:566:10
    frame #10: 0x0000aaaac8665a34 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ResetAndGetTSLocalLockManager(this=0x000033423fc1ad80) at tablet_server.cc:797:28
    frame #11: 0x0000aaaac8665a18 yb-tserver`yb::tserver::YsqlLeasePoller::Poll() [inlined] yb::tserver::TabletServer::ProcessLeaseUpdate(this=0x000033423fc1ad80, lease_refresh_info=0x000033423a476b80) at tablet_server.cc:828:22
    frame yugabyte#12: 0x0000aaaac8665950 yb-tserver`yb::tserver::YsqlLeasePoller::Poll(this=<unavailable>) at ysql_lease_poller.cc:143:18
    frame yugabyte#13: 0x0000aaaac8438d58 yb-tserver`yb::tserver::MasterLeaderPollScheduler::Impl::Run(this=0x000033423ff5cc80) at master_leader_poller.cc:125:25
    frame yugabyte#14: 0x0000aaaac89ffd18 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ne190100](this=0x000033423ffc7930) const at function.h:430:12
    frame yugabyte#15: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000033423ffc7930) const at function.h:989:10
    frame yugabyte#16: 0x0000aaaac89ffd04 yb-tserver`yb::Thread::SuperviseThread(arg=0x000033423ffc78c0) at thread.cc:937:3
    frame yugabyte#17: 0x0000ffffac0378b8 libpthread.so.0`start_thread + 392
    frame yugabyte#18: 0x0000ffffac093afc libc.so.6`thread_start + 12
```
This is due to accessing unique_ptr `thread_pool_token_` after it has been reset.

This revision fixes the issue by not scheduling any tasks on the threadpool once the shutdown flags has been set (hence not accessing `thread_pool_token_`). Since we wait for in-progress requests at the OLM and also in-progress resume tasks scheduled on the messenger using `waiters_amidst_resumption_on_messenger_`, it is safe to say that `thread_pool_token_` would not be accessed once it is reset.
Jira: DB-17121

Test Plan:
Jenkins

./yb_build.sh --cxx-test='TEST_F(PgObjectLocksTestRF1, TestShutdownWithWaiters) {'

Reviewers: rthallam, amitanand, sergei

Reviewed By: amitanand

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D44662
iSignal pushed a commit that referenced this pull request Jul 10, 2025
…ow during index backfill.

Summary:
In the last few weeks we have seen few instances of the stress test (with various nemesis)
run into a master crash caused by a stack trace that looks like:

```
 * thread #1, name = 'yb-master', stop reason = signal SIGSEGV: invalid address
   * frame #0: 0x0000aaaad52f5fc4 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::shared_ptr<yb::master::BackfillTablet>::shared_ptr[abi:ue170006]<yb::master::BackfillTablet, void>(this=<unavailable>, __r=std::__1:: weak_ptr<yb::master::BackfillTablet>::element_type @ 0x000013e4bf787778) at shared_ptr.h:701:20
     frame #1: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::enable_shared_from_this<yb::master::BackfillTablet>::shared_from_this[abi:ue170006](this=0x000013e4bf787778) at shared_ptr.h:1954:17
     frame #2: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=0x000013e4bf787778) at backfill_index.cc:1300:50
     frame #3: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
     frame #4: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4d458) at backfill_index.cc:1620:5
     frame #5: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:470:3
     frame #6: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:273:5
     frame #7: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4d458) at backfill_index.cc:1463:19
     frame #8: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame #9: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
     frame #10: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cd98) at backfill_index.cc:1620:5
     frame #11: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:470:3
     frame yugabyte#12: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:273:5
     frame yugabyte#13: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cd98) at backfill_index.cc:1463:19
     frame yugabyte#14: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#15: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:     1323:10
     frame yugabyte#16: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cfd8) at backfill_index.cc:1620:5
     frame yugabyte#17: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:470:3
     frame yugabyte#18: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:273:5
     frame yugabyte#19: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cfd8) at backfill_index.cc:1463:19
     frame yugabyte#20: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#21: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:     1323:10

...

   frame yugabyte#2452: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bdc7ed98) at backfill_index.cc:1620:5
     frame yugabyte#2453: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:470:3
     frame yugabyte#2454: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:273:5
     frame yugabyte#2455: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bdc7ed98) at backfill_index.cc:1463:19
     frame yugabyte#2456: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
     frame yugabyte#2457: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:   1323:10
     frame yugabyte#2458: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4ba1ff458) at backfill_index.cc:1620:5
     frame yugabyte#2459: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:470:3
     frame yugabyte#2460: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:273:5
     frame yugabyte#2461: 0x0000aaaad52c0260 yb-master`yb::master::RetryingRpcTask::RunDelayedTask(this=0x000013e4ba1ff458, status=0x0000ffffab2668c0) at async_rpc_tasks.cc:432:14
     frame yugabyte#2462: 0x0000aaaad5c3f838 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] boost::function1<void, yb::Status         const&>::operator()(this=0x000013e4bff63b18, a0=0x0000ffffab2668c0) const at function_template.hpp:763:14
     frame yugabyte#2463: 0x0000aaaad5c3f81c yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] yb::rpc::DelayedTask::                    TimerHandler(this=0x000013e4bff63ae8, watcher=<unavailable>, revents=<unavailable>) at delayed_task.cc:155:5
     frame yugabyte#2464: 0x0000aaaad5c3f284 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(loop=<unavailable>, w=<unavailable>, revents=<unavailable>) at ev++.h:479:7
     frame yugabyte#2465: 0x0000aaaad4cdf170 yb-master`ev_invoke_pending + 112
     frame yugabyte#2466: 0x0000aaaad4ce21fc yb-master`ev_run + 2940
     frame yugabyte#2467: 0x0000aaaad5c725fc yb-master`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000013e4bfcfadf8, flags=0) at ev++.h:211:7
     frame yugabyte#2468: 0x0000aaaad5c725f4 yb-master`yb::rpc::Reactor::RunThread(this=0x000013e4bfcfadc0) at reactor.cc:735:9
     frame yugabyte#2469: 0x0000aaaad65c61d8 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000013e4bfeffa80) const at function.h:517:16
     frame yugabyte#2470: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000013e4bfeffa80) const at function.h:1168:12
     frame yugabyte#2471: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(arg=0x000013e4bfeffa20) at thread.cc:895:3
```

Essentially, a BackfillChunk is considered done (without sending out an RPC) and launches the next BackfillChunk; which does the same.

This may happen if `BackfillTable::indexes_to_build()` is empty, or if the `backfill_jobs()` is empty. However, based on the code reading
we should only get there, ** after ** marking `BackfillTable::done_` as `true`.

If for some reason, we have `indexes_to_build()` as `empty` and `BackfillTable::done_ == false`, we could get into this infinite recursion.

Since I am unable to explain and recreate how this happens, I'm adding a test flag `TEST_simulate_empty_indexes` to repro this.

Fix: We update `BackfillChunk::SendRequest` to handle the empty `indexes_to_build()` as a failure rather than treating this as a success.
This prevents the infinite recursion.

Also, adding a few log lines that may help better understand the scenario if we run into this again.
Jira: DB-17296

Test Plan: yb_build.sh fastdebug  --cxx-test pg_index_backfill-test --gtest_filter *.SimulateEmptyIndexesForStackOverflow*

Reviewers: zdrudi, rthallam, jason

Reviewed By: zdrudi

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D45031
ddhodge pushed a commit that referenced this pull request Aug 20, 2025
…s closed in multi route pooling

Summary:
**Issue Summary**

A core dump was triggered during a ConnectionBurst stress test, with the crash occurring in the od_backend_close_connection function with multi route pooling. The stack trace is as follows:

frame #0: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] mm_tls_free(io=0x0000000000000000) at tls.c:91:10
    frame #1: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] machine_io_free(obj=0x0000000000000000) at io.c:201:2
    frame #2: 0x00005601a627129e odyssey`od_backend_close_connection [inlined] od_io_close(io=0x000031f53e72b8b8) at io.h:77:2
    frame #3: 0x00005601a627128c odyssey`od_backend_close_connection(server=0x000031f53e72b880) at backend.c:56:2
    frame #4: 0x00005601a6250de5 odyssey`od_router_attach(router=0x00007fff00dbeb30, client_for_router=0x000031f53e5df180, wait_for_idle=<unavailable>, external_client=0x000031f53ee30680) at router.c:1010:6
    frame #5: 0x00005601a6258b1b odyssey`od_auth_frontend [inlined] yb_execute_on_control_connection(client=0x000031f53ee30680, function=<unavailable>) at frontend.c:2842:11
    frame #6: 0x00005601a6258b0b odyssey`od_auth_frontend(client=0x000031f53ee30680) at auth.c:677:8
    frame #7: 0x00005601a626782e odyssey`od_frontend(arg=0x000031f53ee30680) at frontend.c:2539:8
    frame #8: 0x00005601a6290912 odyssey`mm_scheduler_main(arg=0x000031f53e390000) at scheduler.c:17:2
    frame #9: 0x00005601a6290b77 odyssey`mm_context_runner at context.c:28:2

**Root Cause**

The crash originated from an improper lock release in the yb_get_idle_server_to_close function, introduced in commit 55beeb0 during multi-route pooling implementation. The function released the lock on the route object, despite a comment explicitly warning against it. After returning to its caller, no lock was held on the route or idle_route. This allowed other coroutines to access and use the same route and its idle server, which the original coroutine intended to close. This race condition led to a crash due to an assertion failure during connection closure.

**Note**
If the order of acquiring locks is the same across all threads or processes differences in the release order alone cannot cause a deadlock. Deadlocks arise from circular dependencies during acquisition, not release.

In the connection manager code base:

Locks are acquired in the order: router → route. This order must be strictly enforced everywhere to prevent deadlocks.
Lock release order varies (e.g., router then route in od_router_route and yb_get_idle_server_to_close, versus the reverse elsewhere). This variation does not cause deadlocks, as release order is irrelevant to deadlock prevention.
Jira: DB-17501

Test Plan: Jenkins: all tests

Reviewers: skumar, vikram.damle, asrinivasan, arpit.saxena

Reviewed By: skumar

Subscribers: svc_phabricator, yql

Differential Revision: https://phorge.dev.yugabyte.com/D45641
iSignal pushed a commit that referenced this pull request Nov 26, 2025
Summary:
The stacktrace of the core dump:
```
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000aaaac59fb720 postgres`FreeTupleDesc [inlined] GetMemoryChunkContext(pointer=0x0000000000000000) at memutils.h:141:12
    frame #1: 0x0000aaaac59fb710 postgres`FreeTupleDesc [inlined] pfree(pointer=0x0000000000000000) at mcxt.c:1500:26
    frame #2: 0x0000aaaac59fb710 postgres`FreeTupleDesc(tupdesc=0x000013d7fd8dccc8) at tupdesc.c:326:5
    frame #3: 0x0000aaaac61c7204 postgres`RelationDestroyRelation(relation=0x000013d7fd8dc9a8, remember_tupdesc=false) at relcache.c:4577:4
    frame #4: 0x0000aaaac5febab8 postgres`YBRefreshCache at relcache.c:5216:3
    frame #5: 0x0000aaaac5feba94 postgres`YBRefreshCache at postgres.c:4442:2
    frame #6: 0x0000aaaac5feb50c postgres`YBRefreshCacheWrapperImpl(catalog_master_version=0, is_retry=false, full_refresh_allowed=true) at postgres.c:4570:3
    frame #7: 0x0000aaaac5feea34 postgres`PostgresMain [inlined] YBRefreshCacheWrapper(catalog_master_version=0, is_retry=false) at postgres.c:4586:9
    frame #8: 0x0000aaaac5feea2c postgres`PostgresMain [inlined] YBCheckSharedCatalogCacheVersion at postgres.c:4951:3
    frame #9: 0x0000aaaac5fee984 postgres`PostgresMain(dbname=<unavailable>, username=<unavailable>) at postgres.c:6574:4
    frame #10: 0x0000aaaac5efe5b4 postgres`BackendRun(port=0x000013d7ffc06400) at postmaster.c:4995:2
    frame #11: 0x0000aaaac5efdd08 postgres`ServerLoop [inlined] BackendStartup(port=0x000013d7ffc06400) at postmaster.c:4701:3
    frame yugabyte#12: 0x0000aaaac5efdc70 postgres`ServerLoop at postmaster.c:1908:7
    frame yugabyte#13: 0x0000aaaac5ef8ef8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1562:11
    frame yugabyte#14: 0x0000aaaac5ddae1c postgres`PostgresServerProcessMain(argc=25, argv=0x000013d7ffe068f0) at main.c:213:3
    frame yugabyte#15: 0x0000aaaac59dee38 postgres`main + 36
    frame yugabyte#16: 0x0000ffff9f606340 libc.so.6`__libc_start_call_main + 112
    frame yugabyte#17: 0x0000ffff9f606418 libc.so.6`__libc_start_main@@GLIBC_2.34 + 152
    frame yugabyte#18: 0x0000aaaac59ded34 postgres`_start + 52
```
It is related to invalidation message. The test involves concurrent DDL execution without object
locking.

I added a few logs to help to debug this issue.

Test Plan:
(1)
Append to the end of file ./build/latest/postgres/share/postgresql.conf.sample:

```
yb_debug_log_catcache_events=1
log_min_messages=DEBUG1
```

(2) Create a RF-1 cluster
```
./bin/yb-ctl create --rf 1
```

(3) Run the following example via ysqlsh:
```
-- === 1. SETUP ===
DROP TABLE IF EXISTS accounts_timetravel;
CREATE TABLE accounts_timetravel (
  id INT PRIMARY KEY,
  balance INT,
  last_updated TIMESTAMPTZ
);

INSERT INTO accounts_timetravel VALUES (1, 1000, now());

\echo '--- 1. Initial Data (The Past) ---'
SELECT * FROM accounts_timetravel;

-- Wait 2 seconds
SELECT pg_sleep(2);

-- === 2. CAPTURE THE "PAST" HLC TIMESTAMP ===
--
--    *** THIS IS THE FIX ***
--    Get the current time as seconds from the Unix epoch,
--    multiply by 1,000,000 to get microseconds,
--    and cast to a big integer.
--
SELECT (EXTRACT(EPOCH FROM now())*1000000)::bigint AS snapshot_hlc \gset
SELECT :snapshot_hlc;
\echo '--- (Snapshot HLC captured) ---'

SELECT * FROM pg_yb_catalog_version;

-- Wait 2 more seconds
SELECT pg_sleep(2);

-- === 3. UPDATE THE DATA ===
UPDATE accounts_timetravel SET balance = 500, last_updated = now() WHERE id = 1;

\echo '--- 2. New Data (The Present) ---'
SELECT * FROM accounts_timetravel;

CREATE TABLE foo(id int);
-- increment the catalog version
ALTER TABLE foo ADD COLUMN val TEXT;

SELECT * FROM pg_yb_catalog_version;
-- === 4. PERFORM THE TIME-TRAVEL QUERY ===
--
-- Set our 'read_time_guc' variable to the HLC value
--
\set read_time_guc :snapshot_hlc

\echo '--- 3. Time-Travel Read (Querying the Past) ---'
\echo 'Setting yb_read_time to HLC (microseconds):' :read_time_guc

-- This will now be interpolated correctly and will succeed.
SET yb_read_time = :read_time_guc;

-- This query will now correctly read the historical data
SELECT * FROM accounts_timetravel;
SELECT * FROM pg_yb_catalog_version;

-- === 5. CLEANUP ===
RESET yb_read_time;
\echo '--- 4. Back to the Present ---'
SELECT * FROM accounts_timetravel;

DROP TABLE accounts_timetravel;
```

(4) Look at the postgres log for the following samples:

```
2025-11-07 18:31:06.223 UTC [3321231] LOG:  Preloading relcache for database 13524, session user id: 10, yb_read_time: 0
```

```
2025-11-07 18:31:06.303 UTC [3321231] LOG:  Building relcache entry for pg_index (oid 2610) took 785 us
```

```
2025-11-07 18:31:09.265 UTC [3321221] LOG:  Rebuild relcache entry for accounts_timetravel (oid 16384)
```

```
2025-11-07 18:31:09.525 UTC [3321221] LOG:  Delete relcache entry for accounts_timetravel (oid 16384)
```

```
2025-11-07 18:31:14.035 UTC [3321221] DEBUG:  Setting yb_read_time to 1762540271568993
```

```
2025-11-07 18:31:14.037 UTC [3321221] LOG:  Preloading relcache for database 13524, session user id: 13523, yb_read_time: 1762540271568993
```

```
2025-11-07 18:31:14.183 UTC [3321221] DEBUG:  Setting yb_read_time to 0
```

Reviewers: kfranz, #db-approvers

Reviewed By: kfranz, #db-approvers

Subscribers: jason, yql

Differential Revision: https://phorge.dev.yugabyte.com/D48114
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants