@Ningzhi-Wang pointed out that maybe using different initialization algorithms can affect the autoregressive baseline.