-
Notifications
You must be signed in to change notification settings - Fork 11
Add arrow-ipc Array -> Byes codec #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Nice one, Ryan. This looks easy enough to support. Two thoughts:
|
|
|
||
| ## Configuration parameters | ||
|
|
||
| - `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.
|
In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies. While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression. |
That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:
So I'd prefer to keep flattening as part of the codec itself. |
The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.
Yes, optionally the implementation could automatically insert a
Nonetheless it seems reasonable to convert to 1d automatically, just like the Note: There are also the arrow |
|
The codec description should say explicitly that the encoded representation should start with a |
This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.
This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.
Implementation: zarr-developers/zarr-python#3613