-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.
For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:
SELECT * from poloclub/diffusiondb
WHERE contains('prompt', 'ceo')
This combined with huggingface/dataset-viewer#398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.
I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.
Metadata
Metadata
Assignees
Labels
No labels