You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -86,7 +86,7 @@ <h1><strong>ROBOSE</strong>: A Simple yet Effective Dual System for Robot Learni
86
86
<divclass="col-md-8 col-md-offset-2">
87
87
<h3>Abstract</h3>
88
88
<pclass="text-justify">
89
-
ROBOSE proposes a simple yet effective dual system for robot learning by hierarchically integrating a high-level Multimodal Large Language Model (MLLM) with a low-level policy model. Through pre-alignment, prompt tuning, and multimodal reasoning learning, ROBOSE significantly enhances generalization, reduces training cost, and achieves state-of-the-art results on challenging robot manipulation benchmarks such as CALVIN and CALVIN-D.
89
+
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from.
ROBOSE bridges the high-level MLLM and the low-level policy using a learned <ACT> token and linear projection. Prompt tuning allows efficient training without altering MLLM parameters. An auxiliary task ensures the MLLM performs multimodal reasoning by predicting actions directly from its latent embeddings. The policy model adopts a diffusion-based learning mechanism conditioned on visual, proprioceptive, and goal features.
101
+
OpenHelix bridges the high-level MLLM and the low-level policy using a learned <ACT> token and linear projection. Prompt tuning allows efficient training without altering MLLM parameters. An auxiliary task ensures the MLLM performs multimodal reasoning by predicting actions directly from its latent embeddings. The policy model adopts a diffusion-based learning mechanism conditioned on visual, proprioceptive, and goal features.
ROBOSE outperforms previous methods across all metrics in CALVIN, CALVIN-E (language generalization), and CALVIN-D (dynamic vision generalization). It achieves better success rates with fewer parameters and less training data, validating its data efficiency and robustness.
119
+
OpenHelix outperforms previous methods across all metrics in CALVIN, CALVIN-E (language generalization), and CALVIN-D (dynamic vision generalization). It achieves better success rates with fewer parameters and less training data, validating its data efficiency and robustness.
title={ROBOSE: A Simple yet Effective Dual System for Robot Learning},
124
+
<pre><code>@article{Cui2024OPENHELIX,
125
+
title={OpenHelix: A Simple yet Effective Dual System for Robot Learning},
126
126
author={Can Cui and Pengxiang Ding and Wenxuan Song and Hangyu Liu and Yang Liu and Bofang Jia and Han Zhao and Siteng Huang and Zhaoxin Fan and Donglin Wang},
0 commit comments