This repository contains a clean codebase for controlling bimanual UR5e robots for dex manipulation tasks. It supports:
- Quest Teleoperation - Using Meta Quest 2 controllers for real-time control
- BC Policy Deployment - Running trained behavior cloning policies
Clone this repository, then clone the required sub-repositories into it.
git clone https://github.com/x-robotics-lab/skill-teleop.git
cd skill-teleop
git clone https://github.com/x-robotics-lab/minbc.gitCreate conda environment and install requirements.
conda create -n screw_driver python=3.10
conda activate screw_driver
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements.txtFor teleoperation, install oculus_reader:
pip install oculus_readerAll hardware-specific settings are centralized in config.py. This includes:
- Network Configuration - IP addresses and ports
- Robot Configuration - Joint limits, IPs, control parameters
- Camera Configuration - Serial numbers and image settings
- Reset Positions - Default joint positions for different tasks
- XHand Configuration - Gripper settings and policy paths
- Data Collection - Save directories and formats
Edit config.py to change hardware settings:
# Example: Change robot IP addresses
ROBOT_CONFIG.left_robot_ip = "192.168.1.10"
ROBOT_CONFIG.right_robot_ip = "192.168.2.10"
# Example: Change camera
CAMERA_CONFIG.wrist_camera_id = "your_camera_serial_number"
# Example: Change data directory
DATA_CONFIG.default_data_dir = "/your/custom/path"Edit config.py to change checkpoints:
# Example: Change RL policy from dexscrew.
self.screw_task_policy_path: str = ("/your/rl/policy/path")
# Example: Change BC checkpoint from minbc.
default_checkpoint_path: str = ("/your/bc/policy/path")The system uses a client-server architecture via ZMQ for communication between hardware components.
Start the camera and robot servers:
python launch_nodes.py --robot_type bimanual_ur --use_faster_cameraArguments:
--robot_type: Robot type (default:bimanual_ur)--use_faster_camera: Use buffered camera mode for better performance (default:True)--hostname: Server hostname (default: fromconfig.py)--image_size: Image size tuple (default: fromconfig.py)
Configuration: Robot IPs are set in config.py:
- Left robot:
ROBOT_CONFIG.left_robot_ip(default:192.168.1.3) - Right robot:
ROBOT_CONFIG.right_robot_ip(default:192.168.2.3)
python run_env.py --agent quest --hz 20 --save_dataFor BC training, please refer to minbc. To convert your collected data into the BC dataset format, use the script ./scripts/convert_to_bc_dataset.py. Make sure to update the SRC_DIR and DST_DIR variables inside the script to match your data locations.
python run_env.py --agent bc --hz 20 --bc_checkpoint_path /path/to/model.ckptCommon Arguments:
--agent: Agent type (questorbc)--hz: Control frequency in Hz (default: 20)--robot_port: Robot ZMQ port (default: fromconfig.py)--base_camera_port: Camera ZMQ port (default: fromconfig.py)--hostname: Server hostname (default: fromconfig.py)
Data Collection Arguments:
--save_data: Enable data saving--save_depth: Save depth images (default:True)--save_png: Save RGB images as PNG (default:False)--data_dir: Data save directory (default: fromconfig.py)
BC Agent Arguments:
--bc_checkpoint_path: Path to BC model checkpoint (default: fromconfig.py)--bc_use_async: Use asynchronous policy execution (default:False)--bc_num_diffusion_iters: Number of diffusion iterations (default: 5)
During execution, the following keyboard controls are available:
| Key | Action |
|---|---|
L |
Stop execution and exit |
R |
Start/trigger data saving (when --save_data is enabled) |
C |
Mark switch event in saved data |
Quest Controller Buttons:
- Left Trigger: Activate/deactivate arm control
- Left Joystick: Control XHand gripper (BC agent)
- Left Grip: Control gripper (Quest agent)
- Right Joystick: Fine-tune vertical position
- Button A/B: Move up/down
- Button X: Start XHand policy (Quest agent)
- Button Y: Stop XHand policy (Quest agent)
- LJ Button: Grasp (Quest agent)
- LG Button: Release (Quest agent)
To stop all hardware nodes:
pkill -9 -f launch_nodes.pyWhen --save_data is enabled, each frame is saved as a pickle file containing:
{
'base_rgb': np.ndarray, # RGB image(s) from camera(s)
'base_depth': np.ndarray, # Depth image(s) (if enabled)
'joint_positions': np.ndarray, # Joint angles for both arms (12D)
'joint_velocities': np.ndarray, # Joint velocities
'eef_speed': np.ndarray, # End-effector speeds
'ee_pos_quat': np.ndarray, # End-effector poses
'tcp_force': np.ndarray, # TCP force/torque readings
'xhand_pos': np.ndarray, # XHand joint positions (12D)
'xhand_act': np.ndarray, # XHand joint commands (12D)
'xhand_tactile': np.ndarray, # XHand tactile sensor data
'control': np.ndarray, # Action taken at this step
'activated': bool, # Whether control was active
'xhand_rl_flag': bool, # Whether XHand policy was running
'switch': bool, # Switch event flag
}A freq.txt file is also saved with timing statistics:
Average FPS: 19.85
Max FPS: 20.12
Min FPS: 19.23
Std FPS: 0.15
- Create a new robot class in
robots/:
from robots.robot import Robot
class MyRobot(Robot):
def num_dofs(self) -> int:
return 6
# Implement other required methods...- Add configuration in
config.py:
@dataclass
class MyRobotConfig:
robot_ip: str = "192.168.1.100"- Update
launch_nodes.pyto support new robot type.
- Create a new agent in
agents/:
from agents.agent import Agent
class MyAgent(Agent):
def act(self, obs: Dict[str, Any]) -> np.ndarray:
# Your control logic here
return action- Update
run_env.pyto add agent creation function.
Edit config.py:
@dataclass
class ResetJointPositions:
left_arm_deg: np.ndarray = np.array([...]) # Your positions
right_arm_deg: np.ndarray = np.array([...])- Use Faster Camera Mode:
--use_faster_camerabuffers frames for consistent rate - Adjust Control Frequency: Lower
--hzfor more reliable execution - Disable Camera View: Run without
--show_camera_viewfor better performance - Async BC Agent: Use
--bc_use_asyncfor lower latency with BC policies
- Check camera connection:
rs-enumerate-devices - Verify serial number in
config.py:CAMERA_CONFIG.wrist_camera_id - Ensure RealSense SDK is installed
- Verify robot IPs in
config.py - Check network connectivity:
ping 192.168.1.3 - Ensure robots are powered on and in remote control mode
- Check firewall settings
- Ensure
launch_nodes.pyis running beforerun_env.py - Check that ports in
config.pyare not in use:lsof -i :5000 - Verify hostname settings
- Check EtherCAT connection
- Verify policy path in
config.py:XHAND_CONFIG.screw_task_policy_path - Ensure
xhand_controllerpackage is installed
- Reduce
--hzvalue - Enable
--use_faster_camera - Disable
--show_camera_view - Check system load:
htop
This codebase is based on HATO: Learning Visuotactile Skills with Two Multifingered Hands.
Key dependencies:
- oculus_reader - Quest controller interface
- ur_rtde - UR robot control
- pyrealsense2 - RealSense camera
- xhand_controller - Robot Era's Xhand document
A simple and efficient implementation for robot behavior cloning with support for both Vanilla BC and Diffusion Policy.
- 🚀 Two Policy Types: Vanilla BC and Diffusion Policy
- 🖼️ Flexible Vision Encoders: Support for DINOv3, DINO, CLIP, or train from scratch
- 🎯 Multi-Modal Input: RGB images, joint positions, velocities, tactile sensors, etc.
- ⚡ Multi-GPU Training: Efficient distributed training support
- 📊 TensorBoard Logging: Real-time training monitoring
- Python 3.8+
- CUDA-compatible GPU
pip install -r requirements.txtRequired packages:
- PyTorch >= 2.0.0
- torchvision >= 0.15.0
- diffusers >= 0.21.0
- tyro >= 0.5.0
- tensorboard >= 2.13.0
If your task doesn't require vision, use only proprioceptive data:
# Single GPU
python train.py train \
--gpu 0 \
--data.data-key joint_positions joint_velocities eef_speed xhand_pos \
--optim.batch-size 128 \
--optim.num-epoch 300Benefits: No need to configure vision encoders, faster training, lower GPU memory.
Please refer to dinov3 for the model and available checkpoints.
# Single GPU with DINOv3
python train.py train \
--gpu 0 \
--data.im-encoder DINOv3 \
--data.dinov3-model-dir /path/to/dinov3 \
--data.dinov3-weights-path /path/to/dinov3/dinov3.ckpt \
--optim.batch-size 64 \
--optim.num-epoch 300# Or use DINO (auto-downloads from PyTorch Hub)
python train.py train \
--gpu 0 \
--data.im-encoder DINO \
--optim.batch-size 64# Edit train.sh to set your GPU IDs
vim train.sh
# Run multi-GPU training
bash train.shOr directly:
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=2 \
train.py train \
--gpu 0,1 \
--multi-gpu \
--optim.batch-size 256 \
--optim.num-epoch 300python train.py train --helpMinBC uses command-line arguments to configure training. There is only one configuration file: configs/base.py, which defines default values.
Priority: Command-line arguments > Default values in configs/base.py
--gpu STR # GPU IDs (e.g., "0" or "0,1,2,3")
--multi-gpu # Enable multi-GPU training
--seed INT # Random seed (default: 0)
--optim.batch-size INT # Batch size (default: 128)
--optim.num-epoch INT # Number of epochs (default: 30)
--optim.learning-rate FLOAT # Learning rate (default: 0.0002)
--output_name STR # Experiment name--data.data-key [KEYS...] # Data modalities to use
# Options: img, joint_positions, joint_velocities,
# eef_speed, ee_pos_quat, xhand_pos, xhand_tactile
--data.im-encoder STR # Vision encoder (only if using 'img')
# Options: DINOv3, DINO, CLIP, scratch
--data.dinov3-model-dir STR # DINOv3 model directory (if using DINOv3)
--data.dinov3-weights-path STR # DINOv3 weights path (if using DINOv3)--policy-type STR # Policy type: "bc" (Vanilla BC) or "dp" (Diffusion Policy)--dp.diffusion-iters INT # Number of diffusion iterations (default: 100)
--dp.obs-horizon INT # Observation horizon (default: 1)
--dp.act-horizon INT # Action horizon (default: 8)
--dp.pre-horizon INT # Prediction horizon (default: 16)Override any parameter directly in the command:
python train.py train \
--gpu 2 \
--optim.batch-size 64 \
--optim.learning-rate 0.0005 \
--data.dinov3-model-dir /your/custom/pathModify configs/base.py to change default values:
# configs/base.py
@dataclass(frozen=True)
class MinBCConfig:
seed: int = 0
gpu: str = '0' # Change default GPU
data_dir: str = 'data/' # Change default data path
...
@dataclass(frozen=True)
class DataConfig:
dinov3_model_dir: str = '/your/path/to/dinov3' # Change default DINOv3 path
...Create or modify training scripts like train.sh:
#!/bin/bash
timestamp=$(date +%Y%m%d_%H%M%S)
python train.py train \
--gpu 0 \
--optim.batch-size 128 \
--optim.num-epoch 300 \
--data.dinov3-model-dir /your/path \
--output_name "exp-${timestamp}"data/
└── your_dataset/
├── train/
│ ├── episode_000/
│ │ ├── step_000.pkl
│ │ ├── step_001.pkl
│ │ └── ...
│ ├── episode_001/
│ └── ...
└── test/
├── episode_000/
└── ...
Each .pkl file should contain a dictionary with the following keys:
action: numpy array of shape(action_dim,)- Robot action at this timestep
Proprioceptive Data:
joint_positions: numpy array of shape(12,)- Joint positionsjoint_velocities: numpy array of shape(12,)- Joint velocitieseef_speed: numpy array of shape(12,)- End-effector speedee_pos_quat: numpy array of shape(12,)- End-effector pose (position + quaternion)xhand_pos: numpy array of shape(12,)- Hand positionxhand_tactile: numpy array of shape(1800,)- Tactile sensor data
Visual Data (if using images):
base_rgb: numpy array of shape(H, W, 3)- RGB image (default: 240x320x3)- Values should be in range [0, 255], dtype: uint8 or uint16
# Example pickle file content
import pickle
import numpy as np
data = {
'action': np.array([...]), # Shape: (24,)
'joint_positions': np.array([...]), # Shape: (12,)
'joint_velocities': np.array([...]), # Shape: (12,)
'base_rgb': np.array([...]), # Shape: (240, 320, 3), uint8
}
with open('step_000.pkl', 'wb') as f:
pickle.dump(data, f)Specify which data modalities to use:
# With images
python train.py train \
--data.data-key img joint_positions xhand_pos
# Without images (only proprioceptive)
python train.py train \
--data.data-key joint_positions joint_velocities eef_speedSet data paths in command line:
python train.py train \
--data-dir /path/to/your/data \
--train-data your_dataset/train \
--test-data your_dataset/testOr modify defaults in configs/base.py:
@dataclass(frozen=True)
class MinBCConfig:
data_dir: str = '/path/to/your/data'
train_data: str = 'your_dataset/train/'
test_data: str = 'your_dataset/test/'python train.py train \
--gpu 0 \
--data.data-key joint_positions \
--optim.batch-size 128 \
--optim.num-epoch 100python train.py train \
--gpu 0 \
--data.data-key joint_positions joint_velocities eef_speed xhand_pos \
--optim.batch-size 128 \
--optim.num-epoch 300python train.py train \
--gpu 0 \
--data.data-key img joint_positions xhand_pos \
--data.im-encoder DINOv3 \
--data.dinov3-model-dir /path/to/dinov3 \
--data.dinov3-weights-path /path/to/dinov3/dinov3.ckpt \
--optim.batch-size 64 \
--optim.num-epoch 300python train.py train \
--gpu 0 \
--policy-type dp \
--data.data-key joint_positions joint_velocities \
--dp.diffusion-iters 100 \
--optim.batch-size 64 \
--optim.num-epoch 300OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=4 \
train.py train \
--gpu 0,1,2,3 \
--multi-gpu \
--data.data-key img joint_positions xhand_pos \
--data.im-encoder DINO \
--optim.batch-size 256 \
--optim.num-epoch 300Training results are saved to outputs/<output_name>/:
outputs/bc-20251125_143022/
├── config.json # Training configuration
├── model_last.ckpt # Latest model checkpoint
├── model_best.ckpt # Best model (lowest test loss)
├── stats.pkl # Data statistics for normalization
├── norm.pkl # Normalization parameters
├── diff_*.patch # Git diff at training time
└── events.out.tfevents.* # TensorBoard logs
tensorboard --logdir outputs/
# Open browser to http://localhost:6006Solution: Either set the correct path or use a different encoder:
# Set correct path
python train.py train --data.dinov3-model-dir /correct/path
# Or use DINO (auto-downloads)
python train.py train --data.im-encoder DINO
# Or train without images
python train.py train --data.data-key joint_positions joint_velocitiesSolutions:
- Reduce batch size:
--optim.batch-size 32 - Reduce prediction horizon:
--dp.pre-horizon 8 - Use fewer workers (modify
num_workersindp/agent.py) - Train without images if not needed
Solutions:
- Set
OMP_NUM_THREADS=1before torchrun - Use
torchruninstead of direct python execution - Check NCCL configuration
- Start Simple: Try training without images first to validate your pipeline
- Data Modalities: Only include necessary data modalities for faster training
- Batch Size: Adjust based on your GPU memory (64-128 for single GPU, 128-256 for multi-GPU)
- Vision Encoder: Use DINO for ease (auto-downloads), DINOv3 for best performance (requires setup)
- Policy Type: Use Vanilla BC for faster training, Diffusion Policy for better performance
- Monitoring: Always check TensorBoard logs to ensure training is progressing
MinBC is modified from HATO DP part, which is a simplification of the original Diffusion Policy.
@article{hsieh2025learning,
title={Learning Dexterous Manipulation Skills from Imperfect Simulations},
author={Hsieh, Elvis and Hsieh, Wen-Han and Wang, Yen-Jen and Lin, Toru and Malik, Jitendra and Sreenath, Koushil and Qi, Haozhi},
journal={arXiv:2512.02011},
year={2025}
}
If you have any questions, please feel free to contact Yen-Jen Wang and Haozhi Qi.
