dLLM.decoding.demo.mov
- [2025-10-14] Dream integration coming soon!
Extremely Greedy Parallel strategy: compares the predicted tokens with the reference answer and only remasks the tokens that do not match in these comparisons.
Use a trained filter
Upon detection of an
Experiments on GSM8K, MATH, HumanEval, and MBPP show that our approach significantly improves throughput (by up to 22.58 times faster) while maintaining model accuracy, demonstrating outstanding generalization and practicality. Each method was evaluated using two generation lengths (256 and 1024) across four datasets. Performance is measured using three metrics: TPS (tokens/sec), speedup, and accuracy score. The highest throughput and speedup values for each configuration are highlighted in bold.
- Install dependencies
pip install -r requirements.txt
- Run the program
- Test single questions
python generate.py- Run evaluations
./eval_llada.sh
- Download the FLAN dataset to
small_model_train/flan - Run the following script
./generate_training_data.sh
You can directly use training.ipynb to train new filter models with your own datasets.
We would like to thank the authors of LLaDA and Fast-dLLM for their excellent work and open-source contributions.
If you find our work useful, please consider citing our paper.
@misc{bao2025learningparallelacceleratingdiffusion,
title={Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding},
author={Wenrui Bao and Zhiben Chen and Dan Xu and Yuzhang Shang},
year={2025},
eprint={2509.25188},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.25188},
}
