-
Notifications
You must be signed in to change notification settings - Fork 360
rfc: Making Storage a Trait #1885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| #[async_trait] | ||
| pub trait Storage: Debug + Send + Sync { | ||
| // File existence and metadata | ||
| async fn exists(&self, path: &str) -> Result<bool>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating comments from @c-thiel
I know that we use these results everywhere, but I think introducing more specific error types that we can match on for storage operations makes sense. They can implement Into of course.
For example, a RateLimited error that we got from the storage service should be treated differently from NotFound or CredentialsExpired.
With Lakekeeper we are currently using our own trait based IO due to many limitations in iceberg-rust, mainly due to unsupported signing mechanisms, missing refresh mechanisms, intransparent errors and missing extendability.
I would gladly switch to iceberg-rust if we get these solved.
Maybe this can serve as some inspiration: https://github.com/lakekeeper/lakekeeper/blob/b8fcf54c627d48a547ef0baf6863949b68579388/crates/io/src/error.rs#L291
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address @c-thiel 's comments, we have several approaches:
- Introduce another set of errors for storage.
- Extend current
ErrorKindfor storage errors. - Extend current
ErrorKind, but with another enum, for example
pub enum IoErrorKind {
FileNotFound,
CredentialExpired,
}
pub enum ErrorKind {
// Existing variants
...
Io(IoErrorKind)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to the Open Questions section of RFC.
I think this is more of a phase 2 problem that we can think and discuss more later.
| // File object creation | ||
| fn new_input(&self, path: &str) -> Result<InputFile>; | ||
| fn new_output(&self, path: &str) -> Result<OutputFile>; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating comments from @c-thiel
Many object stores have a good way of running batch deletions, for example the DeleteObjects API in AWS S3. How would you feel about including a delete_batch method too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can provides a function like delete_iter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think this should not be included in the initial cut tho. We can add the new function when stabilizing the Storage trait. I'll update the implementation plan to mention it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this in the phase 2 implementation plan of RFC
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr, generally LGTM! One missing point is, I want the StorageBuilderRegistry to have some built in StorageBuilder registered when user creating a new catalog instance. I currenlty don't have a good solution, one approach would be to have a standalone crate, which loads built in StorageBuilders when StorageBuilderRegistry is initiated. And then we could have catalog crates to depend on it.
| #[async_trait] | ||
| pub trait Storage: Debug + Send + Sync { | ||
| // File existence and metadata | ||
| async fn exists(&self, path: &str) -> Result<bool>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address @c-thiel 's comments, we have several approaches:
- Introduce another set of errors for storage.
- Extend current
ErrorKindfor storage errors. - Extend current
ErrorKind, but with another enum, for example
pub enum IoErrorKind {
FileNotFound,
CredentialExpired,
}
pub enum ErrorKind {
// Existing variants
...
Io(IoErrorKind)
}| use iceberg::io::FileIOBuilder; | ||
|
|
||
| // Basic usage (same as the existing code) | ||
| let file_io = FileIOBuilder::new("s3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need FileIOBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that FileIOBuilder seems excessive. I'm keeping FileIOBuilder here for now mainly because I'm uncertain where to keep Extensions right now mainly due to serde-related concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After StorageBuilder becomes serializable, we should be able to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've explored a bit more on the serialization issue, and I realized that the existing FileIOBuilder is not serializable due to pub struct Extensions(HashMap<TypeId, Arc<dyn Any + Send + Sync>>);. And due to the unserializable nature of Any, it won't make sense to move Extensions to either Storage or StorageBuilder
We should be able to remove Extensions and the FileIOBuilder once the Storage trait is ready to use, so users can register a storage with credential loader directly. But I think it's better to address this problem later when we are stabilizing the Storage trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this in the Implementation Plan of RFC and marked the removal of Extensions and FileIOBuilder as the phase 2 item
docs/rfcs/0002_storage_trait.md
Outdated
| ↓ | ||
| FileIOBuilder::build() | ||
| ↓ | ||
| StorageBuilderRegistry::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we should remove
FileIOBuildersince then? - The
StorageBuilderRegisteryshould be an instance in catalog instance?
Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
This reverts commit e7d67d8.
docs/rfcs/0002_storage_trait.md
Outdated
|
|
||
| ```rust | ||
| #[derive(Debug, Clone)] | ||
| pub struct StorageBuilderRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe we also need a trait for StorageBuilderRegistry, and this trait lives in iceberg crate.
pub trait StorageBuilderRegistry {
}The motivation is that we may also need a default concrete registry implementation, which by default loads builtin builders. And the default registry implementaion should have a standalone crate, and it needs to depend on different storages.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this and really happy that RFC methods been adopted.
| ┌───────────────┼───────────────┬───────────────┐ | ||
| │ │ │ │ | ||
| ┌───────┴────────────┐ │ ┌────────────┴──────┐ ┌────┴──────────┐ | ||
| │ crates/storage/ │ │ │ crates/storage/ │ │ Third-Party │ | ||
| │ opendal/ │ │ │ object_store/ │ │ Crates │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ │ ┌───────────┐ │ | ||
| │ │ opendal-s3 │ │ │ │ │ objstore-s3 │ │ │ │ custom │ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ │ │ storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ │ │impl traits│ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ │ └───────────┘ │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ └───────────────┘ | ||
| │ │ opendal-fs │ │ │ │ │ objstore-gcs │ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ | ||
| │ │ opendal-gcs │ │ │ │ │ objstore-azure│ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ | ||
| │ ... (oss, azure, │ │ └───────────────────┘ | ||
| │ memory) │ │ | ||
| └────────────────────┘ │ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current feeling is that we don't need to add impls for every service, instead we add impls for every provider. One provider can provide multiple impls, and that can simplify a lot of our work.
For example, opendal can provide S3, GCS, OSS, Hf, and more, while object_store supports S3 and GCS. We can also have a fs implementation that only provides FS support. And maybe in the future, we can have hdfs which can provide hdfs-s3, hdfs-gcs and more.
This design matches the actual status better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this thread on the old google design doc is related: https://docs.google.com/document/d/1-CEvRvb52vPTDLnzwJRBx5KLpej7oSlTu_rg0qKEGZ8/edit?disco=AAABrRO9Prk
This design decision definitely deserves another round of discussion. Whether we use a single, general Storage type (e.g., OpenDALStorage) or multiple scheme-specific types (OpenDALS3Storage, OpenDALOSSStorage, etc.) depends on how we define Storage. Is it essentially an Iceberg wrapper around something like object_store::ObjectStore, or is it intended to serve as a higher-level Iceberg abstraction that operates across different storage backends?
If it’s more of a thin wrapper, then having multiple scheme-specific types makes more sense. If it’s meant to be a higher-level abstraction, a unified multi-scheme implementation feels more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also have a fs implementation that only provides FS support
I'm feeling that having a "default" Storage will be a must, just like we are having MemoryCatalog in the iceberg crate, we will need to at least allow users to do basic fs operations in places like iceberg crate where concrete Storage implementations like ObjectStoreStorage are unavailable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm feeling that having a "default" Storage will be a must, just like we are having MemoryCatalog in the
icebergcrate, we will need to at least allow users to do basic fs operations in places likeicebergcrate where concrete Storage implementations like ObjectStoreStorage are unavailable
I don't think so. Users should always need to pick a Storage impl to do IO.
docs/rfcs/0002_storage_trait.md
Outdated
| │ │ - pub trait Storage │ │ | ||
| │ │ - pub trait StorageBuilder │ │ | ||
| │ │ - pub struct StorageBuilderRegistry │ │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the previous comment, we should only have one trait that provides similar functionality from Storage and StorageBuilderRegistry. The new storage abstraction should handle s3://path/to/file directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: do you mean that we don't need a registry. Instead, we can just configure to use a general Storage implementation that should be able to handle multiple storage schemes like s3/gcs/...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, users just need to pick a Storage implementation. And this impl supports how many services, and what kinds of services, are on their own.
This design make it easier for users to compose different storage services on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've listed this in the Open Questions section of RFC, and marked this as a Phase 1 item
| // File object creation | ||
| fn new_input(&self, path: &str) -> Result<InputFile>; | ||
| fn new_output(&self, path: &str) -> Result<OutputFile>; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can provides a function like delete_iter.
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?