-
Notifications
You must be signed in to change notification settings - Fork 125
feat: Add CheckpointManifestReader to process sidecar files #1500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| )) | ||
| ); | ||
|
|
||
| let sidecars: Vec<_> = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation shouldn't be necessary here because AfterManifest::Sidecars forces it
| let sidecars: Vec<_> = self | |
| let sidecars = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly the sidecars.is_empty() somehow doesn't let the type inference work :(
698da9b to
5db45f5
Compare
e54ba02 to
58604ff
Compare
|
Note: Seems that arrow stuff is causing errors in the CI. Will fix in a separate Pr. |
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks great. I do wonder if there could be some way to avoid the pattern of "read the iter -> then call a final method to get the sidecars". I think it would be nice if it could just be that the final element of the iterator is the sidecars, but I realize the types don't quite line up. Maybe worth thinking about if we could use Either or something else to make that flow work, and if you think that's better.
kernel/src/log_reader/manifest.rs
Outdated
| /// Phase that processes single-part checkpoint. This also treats the checkpoint as a manifest file | ||
| /// and extracts the sidecar actions during iteration. | ||
| #[allow(unused)] | ||
| pub(crate) struct ManifestPhase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the name ManifestPhase, because it doesn't tell you what it's a phase of. Could this just be a CheckpointManifestProcessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with CheckpointManifestReader
kernel/src/log_reader/manifest.rs
Outdated
| /// - `Sidecars`: Extracted sidecar files | ||
| /// - `Done`: No sidecars found | ||
| #[allow(unused)] | ||
| pub(crate) fn finalize(self) -> DeltaResult<AfterManifest> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, it's not clear what we're "finalizing" here. I'd maybe prefer to just call this get_extracted_sidecars or something like that, and it can return Option<Vec<FileMeta>>, unless we think this will return something beyond sidecars or "done" in the future.
Regardless let's make very clear in the doc comment that the iterator needs to be exhausted before you can call this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, simplified to just Vec<FileMeta>. Added docs as well
a6e206c to
563a6d2
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1500 +/- ##
==========================================
+ Coverage 84.55% 84.56% +0.01%
==========================================
Files 123 125 +2
Lines 33366 33584 +218
Branches 33366 33584 +218
==========================================
+ Hits 28213 28401 +188
- Misses 3768 3775 +7
- Partials 1385 1408 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
77c7f93 to
a534df5
Compare
kernel/src/log_reader/commit.rs
Outdated
| use std::sync::Arc; | ||
| use url::Url; | ||
|
|
||
| fn load_test_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: this was moved to utils, but also the semantics changed a bit. Now it tries treating the test file as zipped. If that fails, it tries reading as if it isn't zipped.
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
| let log_root = log_segment.log_root.clone(); | ||
| assert_eq!(log_segment.checkpoint_parts.len(), 1); | ||
| let checkpoint_file = &log_segment.checkpoint_parts[0]; | ||
| let mut manifest_phase = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: variable name doesn't make sense anymore :)
| let mut manifest_phase = | |
| let mut checkpoint_manifest_reader = |
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
This PR introduces the CheckpointManifestReader. This is a LogReader that processes manifest files and extracts sidecar paths for later processing. Upon completion, the CheckpointManifestReader returns either:
How was this change tested?
The following cases are checked: