-
Notifications
You must be signed in to change notification settings - Fork 55
flux-fsck: support --job-aware #7194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Problem: Some code breaks up function parameters onto multiple lines that is not necessary and does not conform to current coding patterns. If a line of code is clearly < 80 chars, do not break up function parameters onto multiple lines.
Problem: The function put_valref_lost_and_found() could be used far more generally, but is currently isolated to repaired valref treeobjs. Generalize the function and rename it to put_lost_and_found().
Problem: A single corrupted entry in a job directory will effectively make all data in the job directory unusable (i.e. if one piece of data is corrupted, other uncorrupted data may not be usable). The --repair option only moves the corrupted data to the lost+found, leaving the uncorrupted data in place. This can effect several job related modules, that expect specific job data to always be available. Support a new --job-aware option to flux-fsck. In concert with the --repair option, if any data in a job directory is corrupted, move all contents of the job directory to the lost+found. Fixes flux-framework#7121
Problem: The new flux-fsck --job-ware option is not documented. Add documentation to flux-fsck(1).
Problem: There is no test coverage for the new --job-aware option in flux-fsck. Add coverage in t2816-fsck-cmd.t.
|
Any thoughts about abstracting some of these functions out into a private library (or private portion of libkvs) that could be used by multiple offline KVS tools or the KVS itself? It'd be nice to shrink the volume of code in fsck.c and have unit tests for some of the functions in here. (I'm just asking - mabye that's not practical) |
In an earlier iteration I did ponder this. I can't remember the specific reasons why, but the Edit: and as you mention, splicing things out into a lib just for unit testing does seem like a good idea |
So I began to look into splicing out some of the "offline KVS" activities into a utility lib, but as I began working on it, I think its benefit shrank and I no longer think it's worth it. A lot of that code is quite "fsck" specific ... it can be generalized some, but there's a limit to the benefit by generalizing it too much. But the bigger issue is b/c we have to test with the content store, "unit" testing will involve
Suddenly, the "unit testing" of this convenience library is basically looking exactly like If there comes a time this could be useful between two different tools, I think we can revisit. |
Problem: A single corrupted entry in a job directory will effectively make all data in the job directory unusable (i.e. if one piece of data is corrupted, other uncorrupted data may not be usable). The --repair option only moves the corrupted data to the lost+found, leaving the uncorrupted data in place. This can effect several job related modules, that expect specific job data to always be available.
Support a new --job-aware option to flux-fsck. In concert with the --repair option, if any data in a job directory is corrupted, move
all contents of the job directory to the lost+found.
Fixes #7121