Skip to content

Conversation

@OussamaSaoudi
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi commented Nov 19, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

This PR introduces the CheckpointManifestReader. This is a LogReader that processes manifest files and extracts sidecar paths for later processing. Upon completion, the CheckpointManifestReader returns either:

  • Done: The entire checkpoint has been processed.
  • Sidecars: Contains paths to the remaining sidecar checkpoint files that must be processed.

How was this change tested?

The following cases are checked:

  1. Parquet manifest checkpoint produces the correct sidecars
  2. Json manifest checkpoint produces the correct sidecars
  3. Single-part checkpoint with no sidecars
  4. Attempting to finalize the CheckpointManifestReader without reaching completion results in an error.

))
);

let sidecars: Vec<_> = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation shouldn't be necessary here because AfterManifest::Sidecars forces it

Suggested change
let sidecars: Vec<_> = self
let sidecars = self

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly the sidecars.is_empty() somehow doesn't let the type inference work :(

@OussamaSaoudi OussamaSaoudi force-pushed the stack/dlr_manifest branch 2 times, most recently from 698da9b to 5db45f5 Compare November 24, 2025 23:07
@OussamaSaoudi OussamaSaoudi marked this pull request as ready for review November 25, 2025 00:01
@OussamaSaoudi
Copy link
Collaborator Author

Note: Seems that arrow stuff is causing errors in the CI. Will fix in a separate Pr.

@OussamaSaoudi OussamaSaoudi changed the title feat: Manifest File Reader feat: Add CheckpointManifestReader to process sidecar files Nov 25, 2025
Copy link
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks great. I do wonder if there could be some way to avoid the pattern of "read the iter -> then call a final method to get the sidecars". I think it would be nice if it could just be that the final element of the iterator is the sidecars, but I realize the types don't quite line up. Maybe worth thinking about if we could use Either or something else to make that flow work, and if you think that's better.

/// Phase that processes single-part checkpoint. This also treats the checkpoint as a manifest file
/// and extracts the sidecar actions during iteration.
#[allow(unused)]
pub(crate) struct ManifestPhase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name ManifestPhase, because it doesn't tell you what it's a phase of. Could this just be a CheckpointManifestProcessor?

Copy link
Collaborator Author

@OussamaSaoudi OussamaSaoudi Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with CheckpointManifestReader

/// - `Sidecars`: Extracted sidecar files
/// - `Done`: No sidecars found
#[allow(unused)]
pub(crate) fn finalize(self) -> DeltaResult<AfterManifest> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, it's not clear what we're "finalizing" here. I'd maybe prefer to just call this get_extracted_sidecars or something like that, and it can return Option<Vec<FileMeta>>, unless we think this will return something beyond sidecars or "done" in the future.

Regardless let's make very clear in the doc comment that the iterator needs to be exhausted before you can call this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, simplified to just Vec<FileMeta>. Added docs as well

@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 85.65217% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.56%. Comparing base (6972292) to head (a534df5).

Files with missing lines Patch % Lines
kernel/src/log_reader/checkpoint_manifest.rs 85.25% 8 Missing and 15 partials ⚠️
kernel/src/log_reader/commit.rs 87.75% 1 Missing and 5 partials ⚠️
kernel/src/utils.rs 85.71% 0 Missing and 3 partials ⚠️
kernel/src/log_segment.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1500      +/-   ##
==========================================
+ Coverage   84.55%   84.56%   +0.01%     
==========================================
  Files         123      125       +2     
  Lines       33366    33584     +218     
  Branches    33366    33584     +218     
==========================================
+ Hits        28213    28401     +188     
- Misses       3768     3775       +7     
- Partials     1385     1408      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

use std::sync::Arc;
use url::Url;

fn load_test_table(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this was moved to utils, but also the semantics changed a bit. Now it tries treating the test file as zipped. If that fails, it tries reading as if it isn't zipped.

Copy link
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

let log_root = log_segment.log_root.clone();
assert_eq!(log_segment.checkpoint_parts.len(), 1);
let checkpoint_file = &log_segment.checkpoint_parts[0];
let mut manifest_phase =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: variable name doesn't make sense anymore :)

Suggested change
let mut manifest_phase =
let mut checkpoint_manifest_reader =

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants