Training is too slow.

We reproduced the Stage 1 code using the Qwen1.5B model on 4×A800 80G GPUs. It took about three days to reach the checkpoint reported in your paper. How did you address the issue of slow training?