System Info
Description
When running DataFlow pipelines (e.g., Reasoning Pipeline, Text Pipeline), an unexpected interruption (network failure, crash, etc.) currently forces a full rerun from step 0—even if intermediate JSONL outputs already exist. This wastes time and compute resources.
Expected Behavior
- On startup, scan the output directory for existing
step_*.jsonl files and determine the highest completed step index.
- Introduce a
--resume flag (or enable automatic resume) that skips all already finished steps and proceeds from the next one.
- Update the documentation with usage examples for this resume feature.
Usage Example
# Assuming output/step_0.jsonl … output/step_3.jsonl already exist
dataflow run reasoning-pipeline \
--input data/questions.jsonl \
--output output/ \
--resume
# Should automatically resume from step_4
### Others
_No response_