Skip to content

Conversation

@gbrgr
Copy link

@gbrgr gbrgr commented Nov 4, 2025

Which issue does this PR close?

What changes are included in this PR?

Integrates virtual field handling for the _file metadata column into RecordBatchTransformer using a pre-computed constants map, eliminating post-processing and duplicate lookups.

Key Changes

New metadata_columns.rs module: Centralized utilities for metadata columns

  • Constants: RESERVED_FIELD_ID_FILE, RESERVED_COL_NAME_FILE
  • Helper functions: get_metadata_column_name(), get_metadata_field_id(), is_metadata_field(), is_metadata_column_name()

Enhanced RecordBatchTransformer:

  • Added constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)> - pre-computed during initialization
  • New with_constant() method - computes Arrow type once during setup
  • Updated to use pre-computed types and values (avoids duplicate lookups)
  • Handles DataType::RunEndEncoded for constant strings (memory efficient)

Simplified reader.rs:

  • Pass full project_field_ids (including virtual) to RecordBatchTransformer
  • Single with_constant() call to register _file column
  • Removed post-processing loop

Updated scan/mod.rs:

  • Use is_metadata_column_name() and get_metadata_field_id() instead of hardcoded checks

Are these changes tested?

Yes, comprehensive tests have been added to verify the functionality:

New Tests (7 tests added)

Table Scan API Tests (7 tests)

  1. test_select_with_file_column - Verifies basic functionality of selecting _file with regular columns
  2. test_select_file_column_position - Verifies column ordering is preserved
  3. test_select_file_column_only - Tests selecting only the _file column
  4. test_file_column_with_multiple_files - Tests multiple data files scenario
  5. test_file_column_at_start - Tests _file at position 0
  6. test_file_column_at_end - Tests _file at the last position
  7. test_select_with_repeated_column_names - Tests repeated column selection

@gbrgr gbrgr changed the title Add support for _file column feat(core): Add support for _file column Nov 4, 2025
@gbrgr gbrgr marked this pull request as ready for review November 4, 2025 14:32
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;

/// Column name for the file path metadata column per Iceberg spec
pub(crate) const RESERVED_COL_NAME_FILE: &str = "_file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vustef
Copy link

vustef commented Nov 6, 2025

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715
But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbrgr for this pr, I left some comments to improve.

/// # Ok(())
/// # }
/// ```
pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have more metadata columns, so I would prefert to put these definition in sth like metadata_columns module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new module

if let Some(column_names) = self.column_names.as_ref() {
for column_name in column_names {
// Skip reserved columns that don't exist in the schema
if column_name == RESERVED_COL_NAME_FILE_INTERNAL {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have sth like is_metadata_column_name() in metadata_columns module, and useis_metadata_column_name so that we could avoid such changes when we add more metadata columns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the new module


/// Helper function to add a `_file` column to a RecordBatch at a specific position.
/// Takes the array, field to add, and position where to insert.
fn create_file_field_at_position(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is not extensible. I prefer what's similar in this pr:

  1. Add constant_map for ArrowReader
  2. Add another variant of RecordBatchTransformer to handle constant field like _file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sketched an approach in the record batch transformer, I took the path of the transformer just having a constant_map stored which can be populated by the reader.

@liurenjie1024
Copy link
Contributor

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.

@gbrgr
Copy link
Author

gbrgr commented Nov 14, 2025

@liurenjie1024 I now resolved the merge conflicts that stem from PR #1821:

  • I removed partition information stored in the RecordBatchTransformer, and instead generically store constant fields in it which can be used to add column sources.
  • I changed all added columns to be REE encoded, as they are all constant values.

@gbrgr gbrgr requested a review from liurenjie1024 November 17, 2025 08:18
// Helper to create REE type with the given values type
// Note: values field is nullable as Arrow expects this when building the
// final Arrow schema with `RunArray::try_new`.
let make_ree = |values_type: DataType| -> DataType {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd limit REEs only to this method, others are not really related to the PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this method to arrow module. For example, you could have a new ToArrowSchemaConverter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to schema.rs as a single helper method.

@gbrgr
Copy link
Author

gbrgr commented Nov 17, 2025

@vustef I changed the PR in the sense that default values are not REE encoded anymore, but only constant fields that come from added metadata fields + partition data.

@liurenjie1024 let us know whether that is OK. If REE is desired in general for all constant columns, I guess it is better to make a follow-up PR to keep changesets smaller.

@vustef
Copy link

vustef commented Nov 24, 2025

@liurenjie1024 just a friendly ping on this for the new round of your feedback, if you have time.

projected_iceberg_field_ids: Vec<i32>,
// Pre-computed constant field information: field_id -> (arrow_type, value)
// Avoids duplicate lookups and type conversions during batch processing
constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of using arrow's DataType here?

Arc::new(NestedField::required(
RESERVED_FIELD_ID_FILE,
RESERVED_COL_NAME_FILE,
Type::Primitive(PrimitiveType::String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the doc field

///
/// # Returns
/// The PrimitiveType of the field, or an error if the field is not a primitive type
pub fn metadata_field_primitive_type(field: &NestedFieldRef) -> Result<PrimitiveType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this method for metadata column only. If you think this is necessary, we could add a as_primitive_type_result method in Type .

Copy link
Author

@gbrgr gbrgr Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this method as not needed anymore

// Only identity transforms should use constant values from partition metadata
if matches!(field.transform, Transform::Identity) {
// Get the partition value for this field
if let Some(Literal::Primitive(value)) = &partition_data[pos] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should not ignore the None case. None means the value is null, which is still valid.
  2. We should check the source field's type and value together to ensure that both of them are primitives. If not, we should throw error.

// Helper to create REE type with the given values type
// Note: values field is nullable as Arrow expects this when building the
// final Arrow schema with `RunArray::try_new`.
let make_ree = |values_type: DataType| -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this method to arrow module. For example, you could have a new ToArrowSchemaConverter

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbrgr , I think mostly looks good, just requires some minor fix.

/// Arrow DataType with Run-End Encoding applied
///
/// # Example
/// ```ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to run this test?

constants.insert(field.source_id, value.clone());
// Handle both None (null) and Some(Literal::Primitive) cases
match &partition_data[pos] {
None => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this incorrect, in this case we should return None, and other case we should return Some(Datum).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we need to change Datum to support null value. We don't need to do it now, please file an issue to track it and add a TODO here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error then in the None case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please

.get(field_id)
.ok_or(Error::new(ErrorKind::Unexpected, "field not found"))?
.0;
let datum = constant_fields.get(field_id).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To aovid unwrap here, we could use constant_fields.get().map()


// Create the values array based on the literal value
let values_array: ArrayRef = match (values_field.data_type(), prim_lit) {
(DataType::Boolean, Some(PrimitiveLiteral::Boolean(v))) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need to duplicate so much with another branch? Can we move this to the value.rs under arrow module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for _file metadata column

3 participants