Add arrow-ipc Array -> Byes codec #41

rabernat · 2025-12-03T19:49:38Z

This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.

This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.

Implementation: zarr-developers/zarr-python#3613

LDeakin · 2025-12-03T20:10:41Z

Nice one, Ryan. This looks easy enough to support.

Two thoughts:

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used
Permit a 2D array input as well, so that multiple columns of the same data type could be stored

d-v-b · 2025-12-03T20:16:15Z

codecs/arrow-ipc/README.md

+
+## Configuration parameters
+
+- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.


why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.

Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.

jbms · 2025-12-03T20:46:20Z

In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies.

While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression.

rabernat · 2025-12-03T21:16:41Z

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

Fail at runtime when the Arrow IPC codec is unable to process the ND array
Implement some kind of dependency system between codecs to make sure they are compatible

So I'd prefer to keep flattening as part of the codec itself.

jbms · 2025-12-03T21:52:08Z

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

Fail at runtime when the Arrow IPC codec is unable to process the ND array

The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.

Implement some kind of dependency system between codecs to make sure they are compatible

Yes, optionally the implementation could automatically insert a reshape codec.

So I'd prefer to keep flattening as part of the codec itself.

Nonetheless it seems reasonable to convert to 1d automatically, just like the bytes and packbits codecs.

Note: There are also the arrow Tensor and SparseTensor message types (https://github.com/apache/arrow/blob/main/format/Tensor.fbs) --- they seem to be independent of the Schema and RecordBatch messages. While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

jbms · 2025-12-03T21:54:57Z

The codec description should say explicitly that the encoded representation should start with a Schema message if that is the intention. The Schema can anyway be determined from the array data type but including it would presumably allow the raw chunk file to be more easily consumed by other software.

add arrow-ipc codec

82e488b

rabernat mentioned this pull request Dec 3, 2025

Arrow ipc Array -> Bytes codec zarr-developers/zarr-python#3613

Open

6 tasks

d-v-b reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add arrow-ipc Array -> Byes codec #41

Add arrow-ipc Array -> Byes codec #41

rabernat commented Dec 3, 2025 •

edited

Loading

Uh oh!

LDeakin commented Dec 3, 2025

Uh oh!

d-v-b Dec 3, 2025

Uh oh!

jbms Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Configuration parameters

		- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.

Add arrow-ipc Array -> Byes codec #41

Are you sure you want to change the base?

Add arrow-ipc Array -> Byes codec #41

Conversation

rabernat commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LDeakin commented Dec 3, 2025

Uh oh!

d-v-b Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

jbms Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rabernat commented Dec 3, 2025 •

edited

Loading