-
Notifications
You must be signed in to change notification settings - Fork 361
feat(core): Add support for _file column
#1824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
_file column_file column
…erg-rust into feature/gb/file-column
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.
crates/iceberg/src/arrow/reader.rs
Outdated
| pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646; | ||
|
|
||
| /// Column name for the file path metadata column per Iceberg spec | ||
| pub(crate) const RESERVED_COL_NAME_FILE: &str = "_file"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a metadata_columns module similar to what java did: https://github.com/apache/iceberg/blob/bb32b90c4ad63f037f0bda197cc53fb08c886934/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L2
@liurenjie1024 I agree for How do we go forward with rethinking this, what would be the action items for us? |
…erg-rust into feature/gb/file-column
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gbrgr for this pr, I left some comments to improve.
crates/iceberg/src/scan/mod.rs
Outdated
| /// # Ok(()) | ||
| /// # } | ||
| /// ``` | ||
| pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have more metadata columns, so I would prefert to put these definition in sth like metadata_columns module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new module
crates/iceberg/src/scan/mod.rs
Outdated
| if let Some(column_names) = self.column_names.as_ref() { | ||
| for column_name in column_names { | ||
| // Skip reserved columns that don't exist in the schema | ||
| if column_name == RESERVED_COL_NAME_FILE_INTERNAL { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have sth like is_metadata_column_name() in metadata_columns module, and useis_metadata_column_name so that we could avoid such changes when we add more metadata columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the new module
crates/iceberg/src/arrow/reader.rs
Outdated
|
|
||
| /// Helper function to add a `_file` column to a RecordBatch at a specific position. | ||
| /// Takes the array, field to add, and position where to insert. | ||
| fn create_file_field_at_position( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is not extensible. I prefer what's similar in this pr:
- Add
constant_mapforArrowReader - Add another variant of
RecordBatchTransformerto handle constant field like_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sketched an approach in the record batch transformer, I took the path of the transformer just having a constant_map stored which can be populated by the reader.
Hi, @vustef I also agree that we should put |
|
@liurenjie1024 I now resolved the merge conflicts that stem from PR #1821:
|
…erg-rust into feature/gb/file-column
| // Helper to create REE type with the given values type | ||
| // Note: values field is nullable as Arrow expects this when building the | ||
| // final Arrow schema with `RunArray::try_new`. | ||
| let make_ree = |values_type: DataType| -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd limit REEs only to this method, others are not really related to the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this method to arrow module. For example, you could have a new ToArrowSchemaConverter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to schema.rs as a single helper method.
|
@vustef I changed the PR in the sense that default values are not REE encoded anymore, but only constant fields that come from added metadata fields + partition data. @liurenjie1024 let us know whether that is OK. If REE is desired in general for all constant columns, I guess it is better to make a follow-up PR to keep changesets smaller. |
|
@liurenjie1024 just a friendly ping on this for the new round of your feedback, if you have time. |
| projected_iceberg_field_ids: Vec<i32>, | ||
| // Pre-computed constant field information: field_id -> (arrow_type, value) | ||
| // Avoids duplicate lookups and type conversions during batch processing | ||
| constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of using arrow's DataType here?
| Arc::new(NestedField::required( | ||
| RESERVED_FIELD_ID_FILE, | ||
| RESERVED_COL_NAME_FILE, | ||
| Type::Primitive(PrimitiveType::String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the doc field
| /// | ||
| /// # Returns | ||
| /// The PrimitiveType of the field, or an error if the field is not a primitive type | ||
| pub fn metadata_field_primitive_type(field: &NestedFieldRef) -> Result<PrimitiveType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add this method for metadata column only. If you think this is necessary, we could add a as_primitive_type_result method in Type .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this method as not needed anymore
| // Only identity transforms should use constant values from partition metadata | ||
| if matches!(field.transform, Transform::Identity) { | ||
| // Get the partition value for this field | ||
| if let Some(Literal::Primitive(value)) = &partition_data[pos] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should not ignore the
Nonecase.Nonemeans the value is null, which is still valid. - We should check the source field's type and value together to ensure that both of them are primitives. If not, we should throw error.
| // Helper to create REE type with the given values type | ||
| // Note: values field is nullable as Arrow expects this when building the | ||
| // final Arrow schema with `RunArray::try_new`. | ||
| let make_ree = |values_type: DataType| -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this method to arrow module. For example, you could have a new ToArrowSchemaConverter
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gbrgr , I think mostly looks good, just requires some minor fix.
| /// Arrow DataType with Run-End Encoding applied | ||
| /// | ||
| /// # Example | ||
| /// ```ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to run this test?
| constants.insert(field.source_id, value.clone()); | ||
| // Handle both None (null) and Some(Literal::Primitive) cases | ||
| match &partition_data[pos] { | ||
| None => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this incorrect, in this case we should return None, and other case we should return Some(Datum).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we need to change Datum to support null value. We don't need to do it now, please file an issue to track it and add a TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error then in the None case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please
| .get(field_id) | ||
| .ok_or(Error::new(ErrorKind::Unexpected, "field not found"))? | ||
| .0; | ||
| let datum = constant_fields.get(field_id).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To aovid unwrap here, we could use constant_fields.get().map()
|
|
||
| // Create the values array based on the literal value | ||
| let values_array: ArrayRef = match (values_field.data_type(), prim_lit) { | ||
| (DataType::Boolean, Some(PrimitiveLiteral::Boolean(v))) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need to duplicate so much with another branch? Can we move this to the value.rs under arrow module?
Which issue does this PR close?
_filemetadata column #1766.What changes are included in this PR?
Integrates virtual field handling for the
_filemetadata column intoRecordBatchTransformerusing a pre-computed constants map, eliminating post-processing and duplicate lookups.Key Changes
New
metadata_columns.rsmodule: Centralized utilities for metadata columnsRESERVED_FIELD_ID_FILE,RESERVED_COL_NAME_FILEget_metadata_column_name(),get_metadata_field_id(),is_metadata_field(),is_metadata_column_name()Enhanced
RecordBatchTransformer:constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>- pre-computed during initializationwith_constant()method - computes Arrow type once during setupDataType::RunEndEncodedfor constant strings (memory efficient)Simplified
reader.rs:project_field_ids(including virtual) to RecordBatchTransformerwith_constant()call to register_filecolumnUpdated
scan/mod.rs:is_metadata_column_name()andget_metadata_field_id()instead of hardcoded checksAre these changes tested?
Yes, comprehensive tests have been added to verify the functionality:
New Tests (7 tests added)
Table Scan API Tests (7 tests)
test_select_with_file_column- Verifies basic functionality of selecting_filewith regular columnstest_select_file_column_position- Verifies column ordering is preservedtest_select_file_column_only- Tests selecting only the_filecolumntest_file_column_with_multiple_files- Tests multiple data files scenariotest_file_column_at_start- Tests_fileat position 0test_file_column_at_end- Tests_fileat the last positiontest_select_with_repeated_column_names- Tests repeated column selection