Commit 5d402b5
committed
[yugabyte#27697] Docdb: Handle Coredump in master due to stack-overflow during index backfill.
Summary:
In the last few weeks we have seen few instances of the stress test (with various nemesis)
run into a master crash caused by a stack trace that looks like:
```
* thread #1, name = 'yb-master', stop reason = signal SIGSEGV: invalid address
* frame #0: 0x0000aaaad52f5fc4 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::shared_ptr<yb::master::BackfillTablet>::shared_ptr[abi:ue170006]<yb::master::BackfillTablet, void>(this=<unavailable>, __r=std::__1:: weak_ptr<yb::master::BackfillTablet>::element_type @ 0x000013e4bf787778) at shared_ptr.h:701:20
frame #1: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] std::__1::enable_shared_from_this<yb::master::BackfillTablet>::shared_from_this[abi:ue170006](this=0x000013e4bf787778) at shared_ptr.h:1954:17
frame #2: 0x0000aaaad52f5fbc yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=0x000013e4bf787778) at backfill_index.cc:1300:50
frame #3: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
frame #4: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4d458) at backfill_index.cc:1620:5
frame #5: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:470:3
frame #6: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4d458) at async_rpc_tasks.cc:273:5
frame #7: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4d458) at backfill_index.cc:1463:19
frame #8: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
frame #9: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc:1323: 10
frame #10: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cd98) at backfill_index.cc:1620:5
frame #11: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:470:3
frame yugabyte#12: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cd98) at async_rpc_tasks.cc:273:5
frame yugabyte#13: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cd98) at backfill_index.cc:1463:19
frame yugabyte#14: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
frame yugabyte#15: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10
frame yugabyte#16: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bbd4cfd8) at backfill_index.cc:1620:5
frame yugabyte#17: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:470:3
frame yugabyte#18: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bbd4cfd8) at async_rpc_tasks.cc:273:5
frame yugabyte#19: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bbd4cfd8) at backfill_index.cc:1463:19
frame yugabyte#20: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
frame yugabyte#21: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10
...
frame yugabyte#2452: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4bdc7ed98) at backfill_index.cc:1620:5
frame yugabyte#2453: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:470:3
frame yugabyte#2454: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4bdc7ed98) at async_rpc_tasks.cc:273:5
frame yugabyte#2455: 0x0000aaaad52f63f0 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone() [inlined] yb::master::BackfillChunk::Launch(this=0x000013e4bdc7ed98) at backfill_index.cc:1463:19
frame yugabyte#2456: 0x0000aaaad52f6324 yb-master`yb::master::BackfillTablet::LaunchNextChunkOrDone(this=<unavailable>) at backfill_index.cc:1303:19
frame yugabyte#2457: 0x0000aaaad52fb0d4 yb-master`yb::master::BackfillTablet::Done(this=0x000013e4bf787778, status=<unavailable>, backfilled_until=<unavailable>, number_rows_processed=<unavailable>, failed_indexes=<unavailable>) at backfill_index.cc: 1323:10
frame yugabyte#2458: 0x0000aaaad52f9dd8 yb-master`yb::master::BackfillChunk::UnregisterAsyncTaskCallback(this=0x000013e4ba1ff458) at backfill_index.cc:1620:5
frame yugabyte#2459: 0x0000aaaad52be9e0 yb-master`yb::master::RetryingRpcTask::UnregisterAsyncTask(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:470:3
frame yugabyte#2460: 0x0000aaaad52bd4d8 yb-master`yb::master::RetryingRpcTask::Run(this=0x000013e4ba1ff458) at async_rpc_tasks.cc:273:5
frame yugabyte#2461: 0x0000aaaad52c0260 yb-master`yb::master::RetryingRpcTask::RunDelayedTask(this=0x000013e4ba1ff458, status=0x0000ffffab2668c0) at async_rpc_tasks.cc:432:14
frame yugabyte#2462: 0x0000aaaad5c3f838 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] boost::function1<void, yb::Status const&>::operator()(this=0x000013e4bff63b18, a0=0x0000ffffab2668c0) const at function_template.hpp:763:14
frame yugabyte#2463: 0x0000aaaad5c3f81c yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(ev_loop*, ev_timer*, int) [inlined] yb::rpc::DelayedTask:: TimerHandler(this=0x000013e4bff63ae8, watcher=<unavailable>, revents=<unavailable>) at delayed_task.cc:155:5
frame yugabyte#2464: 0x0000aaaad5c3f284 yb-master`void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::DelayedTask, &yb::rpc::DelayedTask::TimerHandler(ev::timer&, int)>(loop=<unavailable>, w=<unavailable>, revents=<unavailable>) at ev++.h:479:7
frame yugabyte#2465: 0x0000aaaad4cdf170 yb-master`ev_invoke_pending + 112
frame yugabyte#2466: 0x0000aaaad4ce21fc yb-master`ev_run + 2940
frame yugabyte#2467: 0x0000aaaad5c725fc yb-master`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000013e4bfcfadf8, flags=0) at ev++.h:211:7
frame yugabyte#2468: 0x0000aaaad5c725f4 yb-master`yb::rpc::Reactor::RunThread(this=0x000013e4bfcfadc0) at reactor.cc:735:9
frame yugabyte#2469: 0x0000aaaad65c61d8 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000013e4bfeffa80) const at function.h:517:16
frame yugabyte#2470: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000013e4bfeffa80) const at function.h:1168:12
frame yugabyte#2471: 0x0000aaaad65c61c4 yb-master`yb::Thread::SuperviseThread(arg=0x000013e4bfeffa20) at thread.cc:895:3
```
Essentially, a BackfillChunk is considered done (without sending out an RPC) and launches the next BackfillChunk; which does the same.
This may happen if `BackfillTable::indexes_to_build()` is empty, or if the `backfill_jobs()` is empty. However, based on the code reading
we should only get there, ** after ** marking `BackfillTable::done_` as `true`.
If for some reason, we have `indexes_to_build()` as `empty` and `BackfillTable::done_ == false`, we could get into this infinite recursion.
Since I am unable to explain and recreate how this happens, I'm adding a test flag `TEST_simulate_empty_indexes` to repro this.
Fix: We update `BackfillChunk::SendRequest` to handle the empty `indexes_to_build()` as a failure rather than treating this as a success.
This prevents the infinite recursion.
Also, adding a few log lines that may help better understand the scenario if we run into this again.
Jira: DB-17296
Test Plan: yb_build.sh fastdebug --cxx-test pg_index_backfill-test --gtest_filter *.SimulateEmptyIndexesForStackOverflow*
Reviewers: zdrudi, rthallam, jason
Reviewed By: zdrudi
Subscribers: ybase, yql
Differential Revision: https://phorge.dev.yugabyte.com/D450311 parent e2f360a commit 5d402b5
File tree
2 files changed
+41
-3
lines changed- src/yb
- master
- yql/pgwrapper
2 files changed
+41
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
117 | 120 | | |
118 | 121 | | |
119 | 122 | | |
| |||
701 | 704 | | |
702 | 705 | | |
703 | 706 | | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
704 | 711 | | |
705 | 712 | | |
706 | 713 | | |
707 | 714 | | |
708 | 715 | | |
| 716 | + | |
| 717 | + | |
709 | 718 | | |
710 | 719 | | |
711 | 720 | | |
| |||
715 | 724 | | |
716 | 725 | | |
717 | 726 | | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
718 | 738 | | |
719 | 739 | | |
720 | 740 | | |
| |||
1081 | 1101 | | |
1082 | 1102 | | |
1083 | 1103 | | |
1084 | | - | |
1085 | | - | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
1086 | 1108 | | |
1087 | 1109 | | |
1088 | 1110 | | |
| |||
1499 | 1521 | | |
1500 | 1522 | | |
1501 | 1523 | | |
1502 | | - | |
| 1524 | + | |
| 1525 | + | |
1503 | 1526 | | |
1504 | 1527 | | |
1505 | 1528 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1766 | 1766 | | |
1767 | 1767 | | |
1768 | 1768 | | |
| 1769 | + | |
| 1770 | + | |
| 1771 | + | |
| 1772 | + | |
| 1773 | + | |
| 1774 | + | |
| 1775 | + | |
| 1776 | + | |
| 1777 | + | |
| 1778 | + | |
| 1779 | + | |
| 1780 | + | |
| 1781 | + | |
| 1782 | + | |
| 1783 | + | |
1769 | 1784 | | |
1770 | 1785 | | |
1771 | 1786 | | |
| |||
0 commit comments