RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu*,1, Lingxuan Wu*,1, Bangguo Li1, Hengkai Tan1, Huayu Chen1, Zhengyi Wang1, Ke Xu1, Hang Su1, Jun Zhu1
1Tsinghua University
*denotes equal contribution

Overview

Abstract image

We present Robotics Diffusion Transformer with 1.2B parameters (RDT-1B), the largest diffusion-based foundation model for robotic manipulation. It is pre-trained on the largest multi-robot collection of 46 datasets with 1M+ episodes. To boost its bimanual capability, we have collected 6K+ episodes (one of the largest to date) on the ALOHA dual-arm robot for fine-tuning. It has set a new benchmark in terms of dexterity, zero-shot generalizability, and few-shot learning. It supports control of almost all modern manipulators (e.g., dual-arm, joints, EEFs, and even wheeled locomotion) and is ready for the community to fine-tune with their robots🚀!

RDT Framework

RDT Framework

Heterogeneous action spaces of various robots are embedded into a unified action space for cross-robot model training. Inputs: low-dimensional proprioception zt, noisy action chunk ãt:t+Ta, control frequency c, and diffusion timestep k, acting as denoising inputs; image and language inputs, acting as conditions. Outputs: denoised action chunk at:t+Ta.

RDT Inference Video Samples

Dexterity

The following videos show that RDT can perform dexterous manipulation tasks. In the sample task, it must push the joystick straight enough to move the robot dog straight forward. Even the slightest push angle may cause the robot dog to deviate.

Unseen Object

The following videos show that RDT can zero-shot generalize to unseen objects.

Water level pouring with left hand

Instruction Following

The following videos show that RDT can stick to provided instructions and zero-shot generalize to unseen modalities. RDT is required to pour water into the cup so that the water level is exactly the same as the specified level, 1/3 or 2/3. Note that RDT has only seen the instructions for water levels of little, half, and full in the training data, while 1/3 and 2/3 are unseen.

Water level pouring with left hand Water level pouring with right hand

Resulting water levels in 8 trails. Left: pouring water with the left hand, specifying the water level as one-third; Right: pouring water with the right hand, specifying the water level as two-thirds.

Unseen Scene

The following videos show that RDT can zero-shot generalize to unseen scenes.

Few-Shot Learning

The following videos show that RDT can learn new skills with 1-5 demonstrations. A skill refers to verbs like "wipe" and "open".

Real-Time Inference

In the experiment, we deliberately limited the movement speed of the robot arm to ensure safety and reduce wear. In fact, the inference frequency of RDT (6 action chunks/sec or 381 actions/sec) can fully support the robot arm to reach a speed comparable to human operators. The following are inference videos without the speed limit of the robot arm.

Comparison with Baselines

We benchmark our pre-trained RDT model (RDT (ours)) against baselines including RDT without pre-training (RDT (scratch)), OpenVLA, ACT, and Octo.

Here are some sample comparative videos:

Dexterity

⚠️ACT
(wrong direction)

❌OpenVLA
(grab failed)

⚠️RDT (scratch)
(not straight)

✅RDT (ours)
(straight route)

Instruction Following (L, 1/3)

⚠️OpenVLA
(pour failed)

❌Octo
(grab failed)

⚠️RDT (scratch)
(wrong water level)

✅RDT (ours)
(correct water level)

Unseen Scene (unseen room 1)

❌ACT
(grab failed)

⚠️Octo
(pour failed)

❌RDT (scratch)
(grab failed)

✅RDT (ours)
(pour success)

Few-Shot Learning (1-shot)

❌ACT
(grab failed)

❌OpenVLA
(grab failed)

❌RDT (scratch)
(fold failed)

✅RDT (ours)
(fold success)

RDT Repeatability

We use the following videos to demonstrate the robustness and repeatability of RDT. It can complete various tasks 5-8 times repeatedly without any failure.

Ablation Study

We conduct ablation studies on various components of RDT to understand their importance. We consider the variants of:

  1. RDT (ours): the original RDT.
  2. RDT (regress): RDT w/o diffusion modeling. It employs the MSE regression training objective.
  3. RDT (small): RDT w/o large parameters. It has only 166M parameters.
  4. RDT (scratch): RDT w/o pre-training. It is trained from scratch during fine-tuning.

Results show that all components are crucial for the success of RDT.

Citation

If you find our work helpful, please cite us:
@article{liu2024rdt,
    title={RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation},
    author={Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun},
    journal={arXiv preprint arXiv:2410.07864},
    year={2024}
}
Thank you!