chore: various refactoring changes for iceberg [iceberg] #2680

parthchandra · 2025-11-03T20:56:21Z

Which issue does this PR close?

Part of the changes needed for #2060
Mostly does cleanup of the native_iceberg_compat APIs so the they do not have Parquet classes. As a plus provides a utility class to allow ParquetMetadata to be serialized and deserialized to/from the Thrift format. This will also be useful in passing ParquetMetadata from JVM to native (for all native scan implementations). Currently the native scans end up reading Parquet metadata again (even though it has already been read in the JVM side) and this can be a costly operation in object stores.

codecov-commenter · 2025-11-03T21:17:53Z

Codecov Report

❌ Patch coverage is 0% with 148 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.31%. Comparing base (f09f8af) to head (25d8a37).
⚠️ Report is 692 commits behind head on main.

Files with missing lines	Patch %	Lines
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	112 Missing ⚠️
...e/comet/parquet/IcebergCometNativeBatchReader.java	0.00%	22 Missing ⚠️
...pache/comet/parquet/ParquetMetadataSerializer.java	0.00%	13 Missing ⚠️
...org/apache/comet/parquet/AbstractColumnReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2680      +/-   ##
============================================
+ Coverage     56.12%   58.31%   +2.18%     
- Complexity      976     1457     +481     
============================================
  Files           119      166      +47     
  Lines         11743    14130    +2387     
  Branches       2251     2395     +144     
============================================
+ Hits           6591     8240    +1649     
- Misses         4012     4690     +678     
- Partials       1140     1200      +60

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-11-04T01:57:08Z

native/core/Cargo.toml

 hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3"]}
-opendal = { version ="0.54.1", optional = true, features = ["services-hdfs"] }
-uuid = "1.0"
+opendal = { version ="0.54.0", optional = true, features = ["services-hdfs"] }


is there a reason for this change? Comet could still choose to use 0.54.1 since it is semver compatible

Looks like this happened due to rebasing. Reverted.

andygrove

LGTM. Thanks @parthchandra

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

martin-g · 2025-11-04T13:34:09Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+            filteredSchema = filteredSchema.add(sparkFields[i]);
+          }
+        }
+        sparkSchema = filteredSchema;


Is it possible that the filtering done here may lead to ArrayIndexOutOfBoundsException at https://github.com/parthchandra/datafusion-comet/blob/d73bcbab9f80836d7229207f309283942501e9ab/common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java#L985 ?
Now the sparkSchema may have less fields than before I see no new logic to protect the .fields()[i] call there.

Yes, you're right. This is not entirely correct. Let me fix this.

Yup. Fixed to match the fields by name.

martin-g · 2025-11-04T13:46:34Z

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A specialized NativeBatchReader for Iceberg that accepts ParquetMetadata as a JSON string. This


accepts ParquetMetadata as a JSON string - actually it accepts byte[] parquetMetadataBytes at https://github.com/apache/datafusion-comet/pull/2680/files#diff-e57878f6cd8036999500de5719f8f4bbe28e1ed5dcb79a02ad7d7eb206f37473R44, i.e. not a String but bytes.

Thank you for catching this. The first version I did used JSON, but this is more efficient.

martin-g · 2025-11-07T06:49:14Z

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

parthchandra · 2025-11-07T18:26:30Z

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

Oops. I had pushed to the wrong branch :(. Corrected.

martin-g

Does it need unit tests for the new classes ?

martin-g · 2025-11-10T06:37:24Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+      }
+
+      //      String timeZoneId = conf.get("spark.sql.session.timeZone");
+      String timeZoneId = "UTC";


Is this intentional ?
If it is then either move the comment below one line up or add a new comment why timeZoneId should be also always UTC. The commented out conf.get("spark.sql.session.timeZone"); could be removed too.

martin-g · 2025-11-10T06:42:18Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+        DataType dataType = null;
+        int sparkSchemaIndex = -1;
+        for (int j = 0; j < sparkFields.length; j++) {
+          if (sparkFields[j].name().equals(field.getName())) {


Should this equality check take into account spark.sql.caseSensitive ?
If it is sensitive then it could be optimized by storing the sparkFields in a Map<String, Field> and lookup by name here instead of looping over them for each field

martin-g · 2025-11-10T06:42:48Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+              i < preInitializedReaders.length && preInitializedReaders[i] != null;
+          int finalI = i;
+          boolean existsInFileSchema =
+              fileFields.stream().anyMatch(f -> f.getName().equals(sparkFields[finalI].name()));


Should this equality check take into account spark.sql.caseSensitive ?

Yes, it should. Fixed

martin-g · 2025-11-10T06:48:13Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

    Path path = new Path(new URI(filePath));
    try (FileReader fileReader =
        new FileReader(
-            CometInputFile.fromPath(path, conf), footer, readOptions, cometReadOptions, metrics)) {


Why is the footer not passed anymore ? This way it will be re-read at https://github.com/parthchandra/datafusion-comet/blob/d8cd7b78c3509b2ec147d4991e0664d1a63febc1/common/src/main/java/org/apache/comet/parquet/FileReader.java#L201-L203

Got removed accidentally. Thanks for catching this!

martin-g · 2025-11-10T06:50:14Z

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

+    this.sparkSchema = requiredSchema;
+  }
+
+  /** Initialize the reader using FileInfo instead of PartitionedFile. */


Suggested change

/** Initialize the reader using FileInfo instead of PartitionedFile. */

/** Initialize the reader using FileInfo instead of PartitionedFile. */

@Override

This is not an override. The parent init method has a different signature.

martin-g · 2025-11-10T06:53:23Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

-            ConstantColumnReader reader =
-                new ConstantColumnReader(nonPartitionFields[i], capacity, useDecimal128);
-            columnReaders[i] = reader;
+          if (preInitializedReaders != null && preInitializedReaders[i] != null) {


Suggested change

if (preInitializedReaders != null && preInitializedReaders[i] != null) {

if (preInitializedReaders != null && i < preInitializedReaders.length && preInitializedReaders[i] != null) {

martin-g · 2025-11-10T06:55:12Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+        int columnIndex = getColumnIndexFromParquetColumn(column);
+        if (columnIndex == -1
+            || preInitializedReaders == null
+            || preInitializedReaders[columnIndex] == null) {


This probably needs a check for boundaries before trying to access this index.

martin-g · 2025-11-10T06:56:12Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+      return fileSize;
+    }
+
+    public URI pathUri() throws Exception {


Suggested change

public URI pathUri() throws Exception {

public URI pathUri() throws URISyntaxException {

parthchandra · 2025-11-13T17:30:54Z

Does it need unit tests for the new classes ?

Functionality is mostly covered by Comet tests and running Iceberg tests with Comet enabled.

parthchandra · 2025-11-13T17:31:46Z

Merged. Thanks @martin-g, @andygrove .

chore: various refactoring changes for iceberg

d73bcba

parthchandra marked this pull request as draft November 3, 2025 20:56

parthchandra marked this pull request as ready for review November 3, 2025 23:56

parthchandra requested a review from andygrove November 3, 2025 23:57

andygrove reviewed Nov 4, 2025

View reviewed changes

andygrove approved these changes Nov 4, 2025

View reviewed changes

martin-g reviewed Nov 4, 2025

View reviewed changes

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java Show resolved Hide resolved

martin-g reviewed Nov 4, 2025

View reviewed changes

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java Show resolved Hide resolved

martin-g reviewed Nov 4, 2025

View reviewed changes

andygrove changed the title ~~chore: various refactoring changes for iceberg~~ chore: various refactoring changes for iceberg [iceberg] Nov 6, 2025

parthchandra added 2 commits November 5, 2025 20:08

address comments

d12f4f4

revert change to opendal version

d8cd7b7

martin-g reviewed Nov 10, 2025

View reviewed changes

address review comments

25d8a37

martin-g approved these changes Nov 13, 2025

View reviewed changes

parthchandra merged commit 35a99e0 into apache:main Nov 13, 2025
117 checks passed

	/** Initialize the reader using FileInfo instead of PartitionedFile. */
	/** Initialize the reader using FileInfo instead of PartitionedFile. */
	@Override

	if (preInitializedReaders != null && preInitializedReaders[i] != null) {
	if (preInitializedReaders != null && i < preInitializedReaders.length && preInitializedReaders[i] != null) {

	public URI pathUri() throws Exception {
	public URI pathUri() throws URISyntaxException {

chore: various refactoring changes for iceberg [iceberg] #2680

chore: various refactoring changes for iceberg [iceberg] #2680

Uh oh!

Conversation

parthchandra commented Nov 3, 2025

Which issue does this PR close?

Uh oh!

codecov-commenter commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g commented Nov 7, 2025

Uh oh!

parthchandra commented Nov 7, 2025

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Nov 13, 2025

Uh oh!

Uh oh!

parthchandra commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Nov 3, 2025 •

edited

Loading