-
Notifications
You must be signed in to change notification settings - Fork 35
Improve logging of rebalancer and recovery #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve logging of rebalancer and recovery #586
Conversation
6b1057d to
64cc837
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents
5a8b3f8 to
f5c25f7
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, shit. I forgot to send the last message of review, I'm very sorry
04c506f to
ccff54f
Compare
46add65 to
a1c095b
Compare
2a9cea8 to
69ca01a
Compare
7f9a3f5 to
dfcbdb3
Compare
a16c606 to
16b39a2
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot! We've made extensive work in making the logs about rebalancing informative. They will greatly assist with incident analysis
Gerold103
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Great that you have taken this task. These logs really needed some polishing.
| local function start_bucket_move(src_storage, dest_storage, bucket_id) | ||
| src_storage:exec(function(bucket_id, replicaset_id) | ||
| ivshard.storage.bucket_send(bucket_id, replicaset_id) | ||
| end, {bucket_id, dest_storage:replicaset_uuid()}) | ||
|
|
||
| dest_storage:exec(function(bucket_id) | ||
| t.helpers.retrying({timeout = 10}, function() | ||
| t.assert(box.space._bucket:select(bucket_id)) | ||
| end) | ||
| end, {bucket_id}) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bucket_send() is a blocking function. So start_bucket_move() apparently does the entire move or fails. Which suggests the name is a bit unfortunate. And you then also need to check the result of bucket_send() that it succeeded.
Also t.assert(box.space._bucket:select(bucket_id)) will always pass. Because select() returns an array. Even an empty array will be treated as true by assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first test we simulate a situation in which the recovery service can't restore any buckets. To reproduce it, we set ERRINJ_RECEIVE_PARTIALLY and override recovery_bucket_stat service on destination storage - g.replica_2_a.
As we set ERRINJ_RECEIVE_PARTIALLY on g.replica_2_a, the bucket_send function on g.replica_1_a will always return an error the bucket is received partially. I propose to split start_bucket_move into two separate functions - start_partial_bucket_move and move_bucket.
start_partial_bucket_movewill send bucket from src_storage to dest_storage and check that this operation ends with error -the bucket is received partially. Also we will check that this bucket hangs with "sending" state on src_storage.move_bucket- the wrapper aroundbucket_sendwhich will additionally check that this function ends successfully and bucket is transferred to dest_storage.
| rebalancer_recovery_group.before_all(function(g) | ||
| global_cfg = vtest.config_new(cfg_template) | ||
| vtest.cluster_new(g, global_cfg) | ||
| g.router = vtest.router_new(g, 'router', global_cfg) | ||
| vtest.cluster_bootstrap(g, global_cfg) | ||
| vtest.cluster_wait_vclock_all(g) | ||
|
|
||
| vtest.cluster_exec_each_master(g, function() | ||
| box.schema.create_space('test_space') | ||
| box.space.test_space:format({ | ||
| {name = 'pk', type = 'unsigned'}, | ||
| {name = 'bucket_id', type = 'unsigned'}, | ||
| }) | ||
| box.space.test_space:create_index('primary', {parts = {'pk'}}) | ||
| box.space.test_space:create_index( | ||
| 'bucket_id', {parts = {'bucket_id'}, unique = false}) | ||
| end) | ||
| end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of having tests like storage_or_router_x_y_z_test.lua is to reuse the replicasets X, Y, and Z and not restart them between test cases. If you need a fresh cluster, then it doesn't make much sense to put into these reusable test files.
But why do you even need a new group? Wasn't the existing cluster enough to test this commit (the first one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| -- | ||
| -- Improve logging of rebalancer and recovery (gh-212). | ||
| -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to be more specific? "Improve" is an extremely generic term. Can mean reformatting of the messages, removal, new details being added. What improvement this test case is checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vshard/storage/init.lua
Outdated
| table.insert(recovered_buckets['SENT'], bucket_id) | ||
| elseif recovery_local_bucket_is_garbage(bucket, remote_bucket) then | ||
| _bucket:update({bucket_id}, {{'=', 2, BGARBAGE}}) | ||
| recovered = recovered + 1 | ||
| table.insert(recovered_buckets['SENT'], bucket_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have SENT twice. Isn't the second supposed to be GARBAGE? Also in Lua you should better write table.key instead of table['key']. The former is shorter and quite easier to read, would you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| log.info('Finish bucket recovery step, %d %s buckets are recovered '.. | ||
| 'among %d', recovered, type, total) | ||
| 'among %d. Recovered buckets: %s', recovered, type, total, | ||
| json_encode(recovered_buckets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON will encode even empty arrays, right? So it will also include output like ... "SENT": [] .... My guess is that we don't need it. It might be more robust to copy-paste mistakes and could make the output even nicer if you wouldn't hardcode the status names in this table and would do something like:
local function save_recovered(dict, id, status)
local ids = dict[status] or {}
table.insert(ids, id)
dict[status] = ids
end
...
if recovery_local_bucket_is_sent(bucket, remote_bucket) then
local status = BSENT
_bucket:update({bucket_id}, {{'=', 2, status}})
recovered = recovered + 1
save_recovered(recovered_buckets, bucket_id, status)
elseif recovery_local_bucket_is_garbage(bucket, remote_bucket) then
local status = BGARBAGE
_bucket:update({bucket_id}, {{'=', 2, status}})
recovered = recovered + 1
save_recovered(recovered_buckets, bucket_id, status)
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if not M.is_rebalancer_active or rebalancing_is_in_progress() then | ||
| return | ||
| return nil, lerror.make('Rebalancer is not active or is in progress') | ||
| end | ||
| local _bucket = box.space._bucket | ||
| local status_index = _bucket.index.status | ||
| local repl_id = M.this_replica and M.this_replica.id or '_' | ||
| if #status_index:select({BSENDING}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'sending') | ||
| end | ||
| if #status_index:select({BRECEIVING}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'receiving') | ||
| end | ||
| if #status_index:select({BGARBAGE}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'garbage') | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefix BUCKET_ suggests that this is a bucket-centric error. So like one specific bucket is bad or isn't found or alike. But this is actually a storage-centric or rebalancer-centric error. It signals that something can't be done because of the storage's state.
Also we already have here a generic code-less error above - 'Rebalancer is not active or is in progress'.
I suggest not to invent anything new here and just return lerror.make() again with a formatted string. Anyway you never use this new error code for anything except tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebalancer-centric error
I renamed error into REBALANCER_INVALID_STATE - Replica %s has %s buckets during rebalancing.
I suggest not to invent anything new here
If we use lerror.make to log errors about non-active rebalancer's buckets we can face with issue. I have already discussed with @Serpentian about this issue, we have decided to leave this new errcode. May be in this thread we will come to a different decision.
| end | ||
| local _bucket = box.space._bucket | ||
| local status_index = _bucket.index.status | ||
| local repl_id = M.this_replica and M.this_replica.id or '_' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can M.this_replica or its .id field be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M.this_replica or its id can be nil in unit tests (e.g. unit/rebalancer) where we don't initialize replica object
vshard/storage/init.lua
Outdated
| if #status_index:select({BSENDING}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'sending') | ||
| end | ||
| if #status_index:select({BRECEIVING}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'receiving') | ||
| end | ||
| if #status_index:select({BGARBAGE}, {limit = 1}) > 0 then | ||
| return | ||
| return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE, | ||
| repl_id, 'garbage') | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amount of duplicated code is growing dangerously. This whole thing could be turned into a loop over an array of these 3 states. Lets do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
16b39a2 to
9623c34
Compare
Before this patch "Finish bucket recovery step ..." logs were printed at the end of recovery even if no buckets were successfully recovered. It led to unnecessary log records. This patch fixes the issue by adding an additional check for the number of recovered buckets. Part of tarantool#212 NO_DOC=bugfix
This patch introduces logging of buckets' ids which were recovered during recovery stage of storage. Part of tarantool#212 NO_DOC=bugfix
This patch adds rebalancer routes' logging. The log file now includes information about the source storage, the number of buckets, and the destination storage where the buckets will be moved. Since the rebalancer service has changed logging of routes that were sent, we change the `rebalancer/rebalancer.test.lua` and `rebalancer/stress_add_remove_several_rs.test.lua` tests. Part of tarantool#212 NO_DOC=bugfix
Before this patch the function `rebalancer_request_state` returned only nil in case of errors (e.g. presence of SENDING, RECEIVING, GARBAGE buckets, not active rebalancer). This makes it inconsistent compared to other rebalancer functions. Now, in case of errors we return `(nil, err)` instead of `nil`. It can help us to propagate a meaningful error in rebalancer service. In addition we introduce a new vshard error for cases when the storage has some non-active buckets during rebalancing - `BUCKET_INVALID_STATE`. Also we remove "Some buckets are not active" log from `rebalancer_service_f` because a meaningful error about non-active buckets is already returned from `rebalancer_download_states`. NO_TEST=refactoring NO_DOC=refactoring
Before this patch the function `rebalancer_download_states` didn't return information about replicaset from which the states could not be downloaded. As a result, the errors returned from `rebalancer_download_states` lack of valuable information about unhealthy replicaset. Now, we add replicaset.id into error which is returned from the `rebalancer_download_states`. It can help us to propagate replicaset.id to rebalancer service. Closes tarantool#212 NO_DOC=bugfix
9623c34 to
62c138e
Compare
Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.
Closes #212
NO_DOC=bugfix