Skip to content

Conversation

@OussamaSaoudi
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi commented Dec 3, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

This PR introduces the Deduplicator trait which will unify Commit and Checkpoint deduplication under a single API. This is done to move the &mut HashSet<..> out of AddRemoveDedup and behind the Deduplicator trait. That way, we can reuse AddRemoveDedupVisitor in an immutable scenario.

How was this change tested?

@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

❌ Patch coverage is 90.65693% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.89%. Comparing base (72cb27c) to head (9312426).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/log_reader/checkpoint_manifest.rs 85.25% 8 Missing and 15 partials ⚠️
kernel/src/parallel/sequential_phase.rs 87.79% 10 Missing and 11 partials ⚠️
kernel/src/scan/log_replay.rs 97.29% 3 Missing and 4 partials ⚠️
kernel/src/log_reader/commit.rs 87.75% 1 Missing and 5 partials ⚠️
kernel/src/log_replay/deduplicator.rs 83.33% 0 Missing and 3 partials ⚠️
kernel/src/utils.rs 85.71% 0 Missing and 3 partials ⚠️
kernel/src/log_segment.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1537      +/-   ##
==========================================
+ Coverage   84.77%   84.89%   +0.12%     
==========================================
  Files         126      130       +4     
  Lines       35679    36372     +693     
  Branches    35679    36372     +693     
==========================================
+ Hits        30246    30878     +632     
- Misses       3965     3987      +22     
- Partials     1468     1507      +39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Dec 3, 2025
@OussamaSaoudi OussamaSaoudi changed the title extract file deduplicator feat: Introduce Deduplicator trait to unify mutable and immutable deduplication Dec 4, 2025
Comment on lines -233 to -109
// These index positions correspond to the order of columns defined in
// `selected_column_names_and_types()`
const ADD_PATH_INDEX: usize = 0; // Position of "add.path" in getters
const ADD_PARTITION_VALUES_INDEX: usize = 1; // Position of "add.partitionValues" in getters
const ADD_DV_START_INDEX: usize = 2; // Start position of add deletion vector columns
const BASE_ROW_ID_INDEX: usize = 5; // Position of add.baseRowId in getters
const REMOVE_PATH_INDEX: usize = 6; // Position of "remove.path" in getters
const REMOVE_DV_START_INDEX: usize = 7; // Start position of remove deletion vector columns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these are moved out because (annoyingly), rust expects a proper type AddRemoveDedupVisitor::<FileActionDeduplicator>::ADD_PATH_INDEX

}
}

impl<'seen> Deduplicator for FileActionDeduplicator<'seen> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this just moves it to the Deduplicator impl block.

Comment on lines +85 to +91
const ADD_PATH_INDEX: usize = 0; // Position of "add.path" in getters
const ADD_PARTITION_VALUES_INDEX: usize = 1; // Position of "add.partitionValues" in getters
const ADD_DV_START_INDEX: usize = 2; // Start position of add deletion vector columns
const BASE_ROW_ID_INDEX: usize = 5; // Position of add.baseRowId in getters
const REMOVE_PATH_INDEX: usize = 6; // Position of "remove.path" in getters
const REMOVE_DV_START_INDEX: usize = 7; // Start position of remove deletion vector columns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these are moved here because (annoyingly), rust expects a proper type AddRemoveDedupVisitor::::ADD_PATH_INDEX

@OussamaSaoudi OussamaSaoudi marked this pull request as ready for review December 4, 2025 01:15
/// - `Ok(None)`: When no file action is found
/// - `Err(...)`: On any error during extraction
pub(crate) fn extract_file_action<'a>(
fn extract_file_action<'a>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here because of unnecessary visibility specifiers from trait Deduplicator.

//! actions selected
//!
use crate::engine_data::{FilteredEngineData, GetData, RowVisitor, TypedGetData as _};
use crate::log_replay::deduplicator::Deduplicator;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imported to have access to the trait's check_record_and_seen

@OussamaSaoudi OussamaSaoudi mentioned this pull request Dec 5, 2025
@OussamaSaoudi OussamaSaoudi force-pushed the stack/dlr_add_rm_dedup branch from 433c0aa to 36651ea Compare December 8, 2025 17:56
@OussamaSaoudi OussamaSaoudi mentioned this pull request Dec 8, 2025
@OussamaSaoudi OussamaSaoudi force-pushed the stack/dlr_add_rm_dedup branch from 36651ea to 8c284ae Compare December 8, 2025 19:13
/// - `dv_start_index` retrieves the storage type (`deletionVector.storageType`).
/// - `dv_start_index + 1` retrieves the path or inline deletion vector (`deletionVector.pathOrInlineDv`).
/// - `dv_start_index + 2` retrieves the optional offset (`deletionVector.offset`).
fn extract_dv_unique_id<'a>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this directly from FileActionDeduplicator.

@OussamaSaoudi OussamaSaoudi force-pushed the stack/dlr_add_rm_dedup branch from d71a38d to 9312426 Compare December 8, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant