Improve logging of rebalancer and recovery #586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mrForza wants to merge 5 commits into tarantool:master from mrForza:mrforza/gh-212-improvement-of-rebalancer-logging

Contributor

mrForza commented Aug 12, 2025

Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.

Closes #212

NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 6b1057d to 64cc837 Compare

August 15, 2025 09:27

mrForza requested a review from Serpentian

August 15, 2025 09:42

mrForza assigned Serpentian

Serpentian reviewed

View reviewed changes

Collaborator

Serpentian left a comment

These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

Serpentian reviewed

View reviewed changes

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 5a8b3f8 to f5c25f7 Compare

August 22, 2025 15:52

mrForza assigned Serpentian and unassigned mrForza

mrForza requested a review from Serpentian

August 23, 2025 13:17

Serpentian reviewed

View reviewed changes

Collaborator

Serpentian left a comment

Oh, shit. I forgot to send the last message of review, I'm very sorry

vshard/storage/init.lua Show resolved Hide resolved

Serpentian reviewed

View reviewed changes

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 04c506f to ccff54f Compare

September 10, 2025 07:47

mrForza assigned Serpentian and unassigned mrForza

mrForza requested a review from Serpentian

September 10, 2025 08:07

Serpentian requested changes

View reviewed changes

vshard/storage/init.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 46add65 to a1c095b Compare

September 17, 2025 13:22

mrForza assigned Serpentian and unassigned mrForza

Serpentian reviewed

View reviewed changes

test/unit/rebalancer.test.lua Outdated Show resolved Hide resolved

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 2a9cea8 to 69ca01a Compare

November 12, 2025 09:38

mrForza requested a review from Serpentian

November 12, 2025 09:59

mrForza assigned Serpentian and unassigned mrForza

Serpentian reviewed

View reviewed changes

test/rebalancer/rebalancer.test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/rebalancer/rebalancer.result Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 7f9a3f5 to dfcbdb3 Compare

November 19, 2025 09:57

mrForza requested a review from Serpentian

November 19, 2025 10:14

mrForza assigned Serpentian and unassigned mrForza

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from a16c606 to 16b39a2 Compare

November 19, 2025 13:25

Serpentian approved these changes

View reviewed changes

Collaborator

Serpentian left a comment

Thank you a lot! We've made extensive work in making the logs about rebalancing informative. They will greatly assist with incident analysis

Serpentian assigned Gerold103 and unassigned Serpentian

Serpentian requested a review from Gerold103

November 20, 2025 10:01

Gerold103 reviewed

View reviewed changes

Collaborator

Gerold103 left a comment

Thanks for the patch! Great that you have taken this task. These logs really needed some polishing.

test/storage-luatest/storage_1_1_1_test.lua Outdated

Comment on lines 107 to 117

    
              local function start_bucket_move(src_storage, dest_storage, bucket_id)

                  src_storage:exec(function(bucket_id, replicaset_id)

                      ivshard.storage.bucket_send(bucket_id, replicaset_id)

                  end, {bucket_id, dest_storage:replicaset_uuid()})

                  dest_storage:exec(function(bucket_id)

                      t.helpers.retrying({timeout = 10}, function()

                          t.assert(box.space._bucket:select(bucket_id))

                      end)

                  end, {bucket_id})

              end

Collaborator

Gerold103 Nov 24, 2025

bucket_send() is a blocking function. So start_bucket_move() apparently does the entire move or fails. Which suggests the name is a bit unfortunate. And you then also need to check the result of bucket_send() that it succeeded.

Also t.assert(box.space._bucket:select(bucket_id)) will always pass. Because select() returns an array. Even an empty array will be treated as true by assertions.

Contributor Author

mrForza Nov 25, 2025

In the first test we simulate a situation in which the recovery service can't restore any buckets. To reproduce it, we set ERRINJ_RECEIVE_PARTIALLY and override recovery_bucket_stat service on destination storage - g.replica_2_a.

As we set ERRINJ_RECEIVE_PARTIALLY on g.replica_2_a, the bucket_send function on g.replica_1_a will always return an error the bucket is received partially. I propose to split start_bucket_move into two separate functions - start_partial_bucket_move and move_bucket.

start_partial_bucket_move will send bucket from src_storage to dest_storage and check that this operation ends with error - the bucket is received partially. Also we will check that this bucket hangs with "sending" state on src_storage.
move_bucket - the wrapper around bucket_send which will additionally check that this function ends successfully and bucket is transferred to dest_storage.

test/storage-luatest/storage_1_1_1_test.lua Outdated

Comment on lines 134 to 168

    
              rebalancer_recovery_group.before_all(function(g)

                  global_cfg = vtest.config_new(cfg_template)

                  vtest.cluster_new(g, global_cfg)

                  g.router = vtest.router_new(g, 'router', global_cfg)

                  vtest.cluster_bootstrap(g, global_cfg)

                  vtest.cluster_wait_vclock_all(g)

                  vtest.cluster_exec_each_master(g, function()

                      box.schema.create_space('test_space')

                      box.space.test_space:format({

                          {name = 'pk', type = 'unsigned'},

                          {name = 'bucket_id', type = 'unsigned'},

                      })

                      box.space.test_space:create_index('primary', {parts = {'pk'}})

                      box.space.test_space:create_index(

                          'bucket_id', {parts = {'bucket_id'}, unique = false})

                  end)

              end)

Collaborator

Gerold103 Nov 24, 2025

The point of having tests like storage_or_router_x_y_z_test.lua is to reuse the replicasets X, Y, and Z and not restart them between test cases. If you need a fresh cluster, then it doesn't make much sense to put into these reusable test files.

But why do you even need a new group? Wasn't the existing cluster enough to test this commit (the first one)?

Contributor Author

mrForza Nov 26, 2025

fixed

test/storage-luatest/storage_1_1_1_test.lua

Comment on lines 157 to 163

    
              --

              -- Improve logging of rebalancer and recovery (gh-212).

              --

Collaborator

Gerold103 Nov 24, 2025

I suggest to be more specific? "Improve" is an extremely generic term. Can mean reformatting of the messages, removal, new details being added. What improvement this test case is checking?

Contributor Author

mrForza Nov 25, 2025

fixed

vshard/storage/init.lua Outdated

Comment on lines 997 to 1001

    
                          table.insert(recovered_buckets['SENT'], bucket_id)

                      elseif recovery_local_bucket_is_garbage(bucket, remote_bucket) then

                          _bucket:update({bucket_id}, {{'=', 2, BGARBAGE}})

                          recovered = recovered + 1

                          table.insert(recovered_buckets['SENT'], bucket_id)

Collaborator

Gerold103 Nov 24, 2025

You have SENT twice. Isn't the second supposed to be GARBAGE? Also in Lua you should better write table.key instead of table['key']. The former is shorter and quite easier to read, would you agree?

Contributor Author

mrForza Nov 25, 2025

fixed

vshard/storage/init.lua

Comment on lines 1009 to +1022

    
                      log.info('Finish bucket recovery step, %d %s buckets are recovered '..

                               'among %d', recovered, type, total)

                               'among %d. Recovered buckets: %s', recovered, type, total,

                               json_encode(recovered_buckets))

Collaborator

Gerold103 Nov 24, 2025

JSON will encode even empty arrays, right? So it will also include output like ... "SENT": [] .... My guess is that we don't need it. It might be more robust to copy-paste mistakes and could make the output even nicer if you wouldn't hardcode the status names in this table and would do something like:

local function save_recovered(dict, id, status)
    local ids = dict[status] or {}
    table.insert(ids, id)
    dict[status] = ids
end

...

if recovery_local_bucket_is_sent(bucket, remote_bucket) then
    local status = BSENT
    _bucket:update({bucket_id}, {{'=', 2, status}})
    recovered = recovered + 1
    save_recovered(recovered_buckets, bucket_id, status)
elseif recovery_local_bucket_is_garbage(bucket, remote_bucket) then
    local status = BGARBAGE
    _bucket:update({bucket_id}, {{'=', 2, status}})
    recovered = recovered + 1
    save_recovered(recovered_buckets, bucket_id, status)
...

Contributor Author

mrForza Nov 25, 2025

fixed

vshard/storage/init.lua

Comment on lines 2945 to 2963

    
                  if not M.is_rebalancer_active or rebalancing_is_in_progress() then

                      return

                      return nil, lerror.make('Rebalancer is not active or is in progress')

                  end

                  local _bucket = box.space._bucket

                  local status_index = _bucket.index.status

                  local repl_id = M.this_replica and M.this_replica.id or '_'

                  if #status_index:select({BSENDING}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'sending')

                  end

                  if #status_index:select({BRECEIVING}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'receiving')

                  end

                  if #status_index:select({BGARBAGE}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'garbage')

                  end

Collaborator

Gerold103 Nov 24, 2025

The prefix BUCKET_ suggests that this is a bucket-centric error. So like one specific bucket is bad or isn't found or alike. But this is actually a storage-centric or rebalancer-centric error. It signals that something can't be done because of the storage's state.

Also we already have here a generic code-less error above - 'Rebalancer is not active or is in progress'.

I suggest not to invent anything new here and just return lerror.make() again with a formatted string. Anyway you never use this new error code for anything except tests.

Contributor Author

mrForza Nov 25, 2025 •

edited

Loading

rebalancer-centric error

I renamed error into REBALANCER_INVALID_STATE - Replica %s has %s buckets during rebalancing.

I suggest not to invent anything new here

If we use lerror.make to log errors about non-active rebalancer's buckets we can face with issue. I have already discussed with @Serpentian about this issue, we have decided to leave this new errcode. May be in this thread we will come to a different decision.

vshard/storage/init.lua

    
                  end

                  local _bucket = box.space._bucket

                  local status_index = _bucket.index.status

                  local repl_id = M.this_replica and M.this_replica.id or '_'

Collaborator

Gerold103 Nov 24, 2025

How can M.this_replica or its .id field be nil?

Contributor Author

mrForza Nov 25, 2025

M.this_replica or its id can be nil in unit tests (e.g. unit/rebalancer) where we don't initialize replica object

vshard/storage/init.lua Outdated

Comment on lines 2951 to 2962

    
                  if #status_index:select({BSENDING}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'sending')

                  end

                  if #status_index:select({BRECEIVING}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'receiving')

                  end

                  if #status_index:select({BGARBAGE}, {limit = 1}) > 0 then

                      return

                      return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,

                                                repl_id, 'garbage')

                  end

Collaborator

Gerold103 Nov 24, 2025

The amount of duplicated code is growing dangerously. This whole thing could be turned into a loop over an array of these 3 states. Lets do that.

Contributor Author

mrForza Nov 25, 2025

fixed

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 16b39a2 to 9623c34 Compare

November 25, 2025 14:12

mrForza added 5 commits

November 26, 2025 11:16


          recovery: reduce spam of "Finish bucket recovery step" logs

b57d254

Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered. It led
to unnecessary log records. This patch fixes the issue by adding an
additional check for the number of recovered buckets.

Part of tarantool#212

NO_DOC=bugfix


          recovery: add logging of recovered buckets

eefe222

This patch introduces logging of buckets' ids which were recovered
during recovery stage of storage.

Part of tarantool#212

NO_DOC=bugfix


          rebalancer: add logging of routes

c05dff5

This patch adds rebalancer routes' logging. The log file now
includes information about the source storage, the number of
buckets, and the destination storage where the buckets will
be moved.

Since the rebalancer service has changed logging of routes that
were sent, we change the `rebalancer/rebalancer.test.lua` and
`rebalancer/stress_add_remove_several_rs.test.lua` tests.

Part of tarantool#212

NO_DOC=bugfix


          rebalancer: refactoring of rebalancer_request_state

9d69cc6

Before this patch the function `rebalancer_request_state` returned only
nil in case of errors (e.g. presence of SENDING, RECEIVING, GARBAGE
buckets, not active rebalancer). This makes it inconsistent compared to
other rebalancer functions.

Now, in case of errors we return `(nil, err)` instead of `nil`. It can
help us to propagate a meaningful error in rebalancer service. In
addition we introduce a new vshard error for cases when the storage has
some non-active buckets during rebalancing - `BUCKET_INVALID_STATE`.

Also we remove "Some buckets are not active" log from
`rebalancer_service_f` because a meaningful error about non-active
buckets is already returned from `rebalancer_download_states`.

NO_TEST=refactoring
NO_DOC=refactoring


          rebalancer: add replicaset.id into rebalancer_request_state errors

62c138e

Before this patch the function `rebalancer_download_states` didn't
return information about replicaset from which the states could not
be downloaded. As a result, the errors returned from
`rebalancer_download_states` lack of valuable information about
unhealthy replicaset.

Now, we add replicaset.id into error which is returned from the
`rebalancer_download_states`. It can help us to propagate replicaset.id
to rebalancer service.

Closes tarantool#212

NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 9623c34 to 62c138e Compare

November 26, 2025 08:35

mrForza requested a review from Gerold103

November 26, 2025 09:20

mrForza assigned Gerold103 and unassigned Gerold103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet