flux-fsck: support --job-aware #7194

chu11 · 2025-11-10T20:30:17Z

Problem: A single corrupted entry in a job directory will effectively make all data in the job directory unusable (i.e. if one piece of data is corrupted, other uncorrupted data may not be usable). The --repair option only moves the corrupted data to the lost+found, leaving the uncorrupted data in place. This can effect several job related modules, that expect specific job data to always be available.

Support a new --job-aware option to flux-fsck. In concert with the --repair option, if any data in a job directory is corrupted, move
all contents of the job directory to the lost+found.

Fixes #7121

Problem: Some code breaks up function parameters onto multiple lines that is not necessary and does not conform to current coding patterns. If a line of code is clearly < 80 chars, do not break up function parameters onto multiple lines.

Problem: The function put_valref_lost_and_found() could be used far more generally, but is currently isolated to repaired valref treeobjs. Generalize the function and rename it to put_lost_and_found().

Problem: A single corrupted entry in a job directory will effectively make all data in the job directory unusable (i.e. if one piece of data is corrupted, other uncorrupted data may not be usable). The --repair option only moves the corrupted data to the lost+found, leaving the uncorrupted data in place. This can effect several job related modules, that expect specific job data to always be available. Support a new --job-aware option to flux-fsck. In concert with the --repair option, if any data in a job directory is corrupted, move all contents of the job directory to the lost+found. Fixes flux-framework#7121

Problem: The new flux-fsck --job-ware option is not documented. Add documentation to flux-fsck(1).

Problem: There is no test coverage for the new --job-aware option in flux-fsck. Add coverage in t2816-fsck-cmd.t.

garlick · 2025-11-12T14:43:39Z

Any thoughts about abstracting some of these functions out into a private library (or private portion of libkvs) that could be used by multiple offline KVS tools or the KVS itself?

It'd be nice to shrink the volume of code in fsck.c and have unit tests for some of the functions in here.

(I'm just asking - mabye that's not practical)

chu11 · 2025-11-12T18:21:33Z

Any thoughts about abstracting some of these functions out into a private library (or private portion of libkvs) that could be used by multiple offline KVS tools or the KVS itself?

In an earlier iteration I did ponder this. I can't remember the specific reasons why, but the flux-fsck needs were (unsurprisingly) simpler than the KVS module needs, and things didn't seem to line up. But as flux-fsck grows and advances, it certainly is something worth re-visiting. I'll put it in a TODO item.

Edit: and as you mention, splicing things out into a lib just for unit testing does seem like a good idea

chu11 · 2025-12-02T21:01:48Z

splicing things out into a lib just for unit testing does seem like a good idea

So I began to look into splicing out some of the "offline KVS" activities into a utility lib, but as I began working on it, I think its benefit shrank and I no longer think it's worth it. A lot of that code is quite "fsck" specific ... it can be generalized some, but there's a limit to the benefit by generalizing it too much.

But the bigger issue is b/c we have to test with the content store, "unit" testing will involve

writing some tool that uses this convenience library
loading up a flux broker w/ content-sqlite loaded
temporarily loading the KVS, so we can put some test data into the KVS
then running tests

Suddenly, the "unit testing" of this convenience library is basically looking exactly like flux-fsck testing.

If there comes a time this could be useful between two different tools, I think we can revisit.

chu11 added 5 commits November 10, 2025 10:13

flux-fsck: do not break up function parameters

9c67088

Problem: Some code breaks up function parameters onto multiple lines that is not necessary and does not conform to current coding patterns. If a line of code is clearly < 80 chars, do not break up function parameters onto multiple lines.

flux-fsck: generalize put_valref_lost_and_found()

bcb6a2b

Problem: The function put_valref_lost_and_found() could be used far more generally, but is currently isolated to repaired valref treeobjs. Generalize the function and rename it to put_lost_and_found().

doc: document flux-fsck --job-aware

359d677

Problem: The new flux-fsck --job-ware option is not documented. Add documentation to flux-fsck(1).

t: cover flux-fsck --job-aware

5496bed

Problem: There is no test coverage for the new --job-aware option in flux-fsck. Add coverage in t2816-fsck-cmd.t.

chu11 mentioned this pull request Nov 10, 2025

rc1: use flux-fsck --job-aware option #7196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flux-fsck: support --job-aware #7194

flux-fsck: support --job-aware #7194

Uh oh!

chu11 commented Nov 10, 2025

Uh oh!

garlick commented Nov 12, 2025

Uh oh!

chu11 commented Nov 12, 2025 •

edited

Loading

Uh oh!

chu11 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flux-fsck: support --job-aware #7194

Are you sure you want to change the base?

flux-fsck: support --job-aware #7194

Uh oh!

Conversation

chu11 commented Nov 10, 2025

Uh oh!

garlick commented Nov 12, 2025

Uh oh!

chu11 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chu11 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chu11 commented Nov 12, 2025 •

edited

Loading