-
Notifications
You must be signed in to change notification settings - Fork 127
feat: Introduce Deduplicator trait to unify mutable and immutable deduplication #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Introduce Deduplicator trait to unify mutable and immutable deduplication #1537
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1537 +/- ##
==========================================
+ Coverage 84.77% 84.89% +0.12%
==========================================
Files 126 130 +4
Lines 35679 36372 +693
Branches 35679 36372 +693
==========================================
+ Hits 30246 30878 +632
- Misses 3965 3987 +22
- Partials 1468 1507 +39 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // These index positions correspond to the order of columns defined in | ||
| // `selected_column_names_and_types()` | ||
| const ADD_PATH_INDEX: usize = 0; // Position of "add.path" in getters | ||
| const ADD_PARTITION_VALUES_INDEX: usize = 1; // Position of "add.partitionValues" in getters | ||
| const ADD_DV_START_INDEX: usize = 2; // Start position of add deletion vector columns | ||
| const BASE_ROW_ID_INDEX: usize = 5; // Position of add.baseRowId in getters | ||
| const REMOVE_PATH_INDEX: usize = 6; // Position of "remove.path" in getters | ||
| const REMOVE_DV_START_INDEX: usize = 7; // Start position of remove deletion vector columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these are moved out because (annoyingly), rust expects a proper type AddRemoveDedupVisitor::<FileActionDeduplicator>::ADD_PATH_INDEX
| } | ||
| } | ||
|
|
||
| impl<'seen> Deduplicator for FileActionDeduplicator<'seen> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: this just moves it to the Deduplicator impl block.
| const ADD_PATH_INDEX: usize = 0; // Position of "add.path" in getters | ||
| const ADD_PARTITION_VALUES_INDEX: usize = 1; // Position of "add.partitionValues" in getters | ||
| const ADD_DV_START_INDEX: usize = 2; // Start position of add deletion vector columns | ||
| const BASE_ROW_ID_INDEX: usize = 5; // Position of add.baseRowId in getters | ||
| const REMOVE_PATH_INDEX: usize = 6; // Position of "remove.path" in getters | ||
| const REMOVE_DV_START_INDEX: usize = 7; // Start position of remove deletion vector columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these are moved here because (annoyingly), rust expects a proper type AddRemoveDedupVisitor::::ADD_PATH_INDEX
| /// - `Ok(None)`: When no file action is found | ||
| /// - `Err(...)`: On any error during extraction | ||
| pub(crate) fn extract_file_action<'a>( | ||
| fn extract_file_action<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here because of unnecessary visibility specifiers from trait Deduplicator.
| //! actions selected | ||
| //! | ||
| use crate::engine_data::{FilteredEngineData, GetData, RowVisitor, TypedGetData as _}; | ||
| use crate::log_replay::deduplicator::Deduplicator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imported to have access to the trait's check_record_and_seen
433c0aa to
36651ea
Compare
36651ea to
8c284ae
Compare
| /// - `dv_start_index` retrieves the storage type (`deletionVector.storageType`). | ||
| /// - `dv_start_index + 1` retrieves the path or inline deletion vector (`deletionVector.pathOrInlineDv`). | ||
| /// - `dv_start_index + 2` retrieves the optional offset (`deletionVector.offset`). | ||
| fn extract_dv_unique_id<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this directly from FileActionDeduplicator.
d71a38d to
9312426
Compare
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
This PR introduces the
Deduplicatortrait which will unify Commit and Checkpoint deduplication under a single API. This is done to move the&mut HashSet<..>out ofAddRemoveDedupand behind theDeduplicatortrait. That way, we can reuseAddRemoveDedupVisitorin an immutable scenario.How was this change tested?