Skip to content

Conversation

@Logiquo
Copy link
Collaborator

@Logiquo Logiquo commented Dec 25, 2025

Contributor: Yongda Fan (yongdaf2@illinois.edu)

Contribution Type: Dataset

Description

  • Temporary directories will be cleared correctly after the processing (including dask-worker-space).
  • Cached data will be cleared if the processing crashed, ensure no corrupted cache file.
  • Better notebook support, always ensure single process when notebook is detected.
  • Task tansformation result is now cached, allow different processors for .set_task without recomput the entire data.
  • SampleDataset now need to .close after use to clear up with underlying data. It also support context manager.

e.g.

with dataset.set_task(...) as sample_dataset:
    ....

or

sample_dataset = dataset.set_task(...)
...
sample_dataset.close()

Benchmark
On a Ryzen 5950x with 64GB memory and NVME disk, using MIMIC4 with mortality prediction task

__init__(num_worker) .set_task(num_worker) memory usage total runtime task transform runtime
41d190a 16GB 180min+ 180min+
1 1 16GB 167min 141min
2 2 16GB 77min 63min
4 4 32GB 41min 31min
8 8 64GB 26min 18min
8 16 64GB 21min 14min
8 32 64GB 16min 8min
image

@Logiquo Logiquo requested a review from jhnwu3 December 25, 2025 12:11
@Logiquo Logiquo marked this pull request as draft December 25, 2025 19:31
@Logiquo Logiquo marked this pull request as ready for review December 26, 2025 01:39
@Logiquo Logiquo added core Core functionality (Patient API, BaseDataset, event stream format, etc.) bug Something isn't working labels Dec 26, 2025
Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, and lgtm! Will test more as I benchmark the process on different machines when I can.

@jhnwu3 jhnwu3 merged commit 82a35ea into sunlabuiuc:master Dec 27, 2025
1 check passed
@Logiquo Logiquo deleted the fix-dataset-tmpdir branch December 27, 2025 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core Core functionality (Patient API, BaseDataset, event stream format, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants