You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: changelog.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,6 +36,10 @@
36
36
- Readers now have a `loop` parameter to cycle over the data indefinitely (useful for training)
37
37
- Readers now have a `shuffle` parameter to shuffle the data before iterating over it
38
38
- In `multiprocessing` mode, file based readers now read the data in the workers (was an option before)
39
+
- We now support two new special batch sizes
40
+
- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
41
+
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
42
+
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
39
43
-:boom: Breaking change: a `map` function returning a list or a generator won't be automatically flattened anymore. Use `flatten()` to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output
40
44
-:boom: Breaking change: the `chunk_size` and `sort_chunks` are now deprecated : to sort data before applying a transformation, use `.map_batches(custom_sort_fn, batch_size=...)`
0 commit comments