RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data

Introducing RDT2

RDT2, the sequel to RDT-1B [1], is the first foundation model that can achieve zero-shot deployment on unseen embodiments for simple open-vocabulary tasks like picking, placing, pressing, wiping, etc. This milestone was made possible by multifaceted efforts:

Hardware redesign: We redesigned the UMI [2] hardware by applying higher-strength materials and more precise tracking methods, ensuring its reliability for large-scale data collection.
Large and diverse data: We collected 10,000+ hours of human manipulation videos in 100+ different indoor scenes, covering the majority of household tasks that a gripper can do.
VLA pretraining: Employing Residual VQ as an action tokenizer, we pretrained Qwen2.5-VL-7B-Instruct [3] on our UMI dataset, enabling superior instruction-following capability.
Diffusion distillation: We trained the RDT model as an action expert with flow-matching loss and then distilled it into a one-step generator, realizing ultra-fast inference speed.

Currently, we have open-sourced code and weights for RDT2-VQ and RDT2-FM. Other components, including data, code, and weights for other models, will be released shortly.

Our introduction video of RDT2.

Vision

The path to embodied superintelligence requires a new paradigm.

Teleoperation, even of the highest quality and zero embodiment gap, has significant drawbacks: it is expensive and non-portable. It is difficult to access diverse scenes and tasks for collecting data, which is necessary for training a universal model.

Our vision is to break free from these constraints. We imagine a future built on wearable systems that seamlessly capture the richness of human activity at a global scale. This approach won't just gather data; it will mirror the very fabric of how we interact with the physical world, providing the essential foundation for embodied superintelligence.

While the ultimate hardware for this vision is on its way, we take a foundational first step by scaling up embodiment-free human data with grippers.

UMI Hardware

The original UMI [2], manufactured using 3D printing, lacks the requisite strength for long-term, high-frequency data collection. To address this limitation, we redesigned the mechanical structure. The new product utilizes a robust nylon 66 and glass fiber composite material, fabricated using CNC precision machining. We abandoned the original SLAM tracking method since it frequently fails in texture-less indoor environments. Instead, we adopted an infrared light-based positioning system (HTC VIVE Tracker 3.0 [4]) to track the 6DoF pose of the end-effector.

Composition of our UMI hardware.

Since our hardware provides a unified end-effector across robots and humans, the embodiment gap is minimized, and models trained using such UMI data can be zero-shot deployed on any robot arm. No tele-operation. No human data collection. No fine-tuning. It is totally plug-and-play. All you need to do is: purchase the specified camera and gripper, use the correct flange and 3D printed camera bracket for mounting, and align the TCP coordinate system.

Easily deployable on any robot arm.

Dataset

We manufactured nearly 100 UMIs and distributed them to 100+ real-world home and office scenes for data collection. We collected 10,000+ hours of manipulation data, covering the vast majority of common human manipulation tasks*. Thanks to our hardware's portability and cheapness, we can collect the same amount of data at about 1/10 cost and 5× speed of teleoperation**. Here, we visualize some example clips from the dataset:

Loading caption...

Since our hardware provides a unified end-effector across robots and humans, the embodiment gap is minimized, and models trained using such UMI data can be zero-shot deployed on any robot arm. No tele-operation. No human data collection. No fine-tuning. It is totally plug-and-play. All you need to do is: purchase the specified camera and gripper, use the correct flange and 3D printed camera bracket for mounting, and align the TCP coordinate system.

*Due to hardware limitations, we excluded tasks involving water contact, heat contact, or requiring five-finger dexterity. We also removed tasks requiring large quantities of consumables, such as cooking.
**Cost estimates include equipment cost and labor cost. Speed estimates include manipulation speed and speed of transfer between different locations.

Training

The training process can be divided into three stages.

Stage 1

In Stage 1, we trained Qwen2.5-VL-7B-Instruct [3], a VLM model once pretrained on Internet-scale text and image data, on pure UMI data (i.e., our 10,000-hour UMI dataset). The model accepts two wrist-view fisheye images and a language instruction as input, and outputs discrete action tokens. The action tokens were discretized from continuous robot actions (6DoF end-effector pose and gripper width of both hands) by residual vector quantization (RVQ) [5][6][7].

We took several measures to stabilize VQ training and improve codebook utilization, including factorized codes, cosine similarity, EMA updates, and codebook restart [8][9][10]. We also decoupled the discretization of rotation, translation, and gripper width as we found it helpful to avoid conflicts among multiple training objectives. As a result, we efficiently compress an action chunk of 0.8 seconds long (30 Hz) into a fixed-length 27 tokens. At the same level of precision, this length is 1/3 that of FAST [11] and 1/8 that of binning [12][13]. Thus, our model has a significantly smaller latency because of fewer forward passes to generate an action chunk.

The outcome model in this stage is named RDT2-VQ. It needs to generate 27 tokens autoregressively (i.e., 27 forward passes) to obtain an action chunk.

Stage 1 (RDT2-VQ): pretrain VLM with discrete action tokens.

Stage 2

In Stage 2, we replaced the RVQ with a 400M RDT model (an improved version of RDT-1B [1]) as an action expert, which attends to the Qwen backbone's KV during denoising, following the best practice in π0 [14] and π0.5 [15]. The model can generate continuous robot actions without discretization errors through five diffusion denoising steps. We copied the weights from the outcome of Stage 1 into the Qwen backbone, freezed it, and trained the RDT model with flow-matching loss.

The outcome model in this stage is named RDT2-FM. We then mixed a tiny amount of real-robot data of UR and Franka with the original UMI data for post-training. We call this post-trained model RDT2-FM-Post to distinguish it from the original. These two models are much faster than the first since they only require one forward pass of Qwen and five forward passes of the 400M RDT model.

Stage 2 (RDT2-FM): train RDT action expert with flow-matching loss.

Stage 3

In Stage 3, we distilled RDT2-FM into a one-step diffusion policy without performance drop, where the Qwen backbone still stayed frozen. The model can map pure noise directly to robot actions through only a single diffusion step, similar to GAN [16].

Comparisons of inference speed between ours and baselines.

Thanks to the effective RVQ and the one-step generator, the inference speed of our 7B models is comparable to and even exceeds the 3B baseline.

The outcome model in this stage is named RDT2-UltraFast. The model is the fastest since it only requires one forward pass of Qwen and one forward pass of the 400M RDT model. This is crucial for many tasks that require real-time responses, such as playing table tennis.

Stage 3 (RDT2-UltraFast): distill RDT2-FM into a one-step diffusion policy.

Model Family

We list the model family of RDT2 as follows:

RDT2-VQ: Stage 1, superior instruction following, slow inference, RL support, released🎉
RDT2-FM: Stage 2, better performance, fast inference, no RL support for now, released🎉
RDT2-FM-Post: Stage 2, twin of RDT2-FM, optimized performance on UR and Franka, coming🔜
RDT2-UltraFast: Stage 3, better performance, ultra-fast inference, no RL support for now, coming🔜

More models and code, including reinforcement learning, are coming soon. Stay tuned!

Results

Phase Transition Point

We invite you to witness the highlight moment of this project. Fresh from training, RDT2 demonstrates robust zero-shot generalization under the full "4U" conditions — Unseen embodiment, Unseen scene, Unseen object, and Unseen language. We describe this as a phase transition point: behavior shifts from narrow specialist to genuine generalist.

The system accepts everyday, open-ended instructions and grounds abstract language in physical behavior. While not yet perfect, this milestone is decisive: the scaling direction is correct, and the model already shows the first clear signs of embodied superintelligence.

Loading caption...

You can issue simple natural-language instructions, and the model analyzes the concepts and maps them to precise control without task-specific scripting. In contrast, traditional pipelines typically require predefined controllers and known configurations for each task.

Loading caption...

Across changes in scene layout, lighting, and surface height, behavior remains stable and consistent: it attends to task-relevant cues (object size, pose, reachable workspace) rather than incidental variation.

Loading caption...

Capability Boundary

To get a glimpse of the in-distribution performance of RDT2-UltraFast, we curated a series of six challenging downstream tasks. These benchmarks were designed to probe the limits of the model in real-world scenarios demanding exceptional speed, precision, and adaptability.

The archery challenge, in particular, presents an extreme test of reaction time. The task requires intercepting an arrow shot from a 20-pound bow at a distance of 15 meters, a feat that is exceptionally difficult, if not impossible, for a human. Our model accomplished this formidable objective with a reaction time of about 100 milliseconds (inference time + camera latency), nearly the fastest recorded human reaction. This achievement demonstrates potential in high-stakes, rapid-response scenarios.

Loading caption...

The subsequent two tasks necessitated not only reaction but also high-precision control and fine-grained coordination. In the incense extinguishing task, the model was required to swing a flame at high velocity, continuously monitoring its state until it was fully extinguished. The execution of this task demanded a trajectory of remarkable continuity and smoothness; any pause or hesitation will give the flame enough time to reignite.

Loading caption...

In the table tennis task, our model demonstrated outstanding predictive capabilities. This was accomplished despite the robot arm's maximum velocity of 1 m/s, a figure less than one-tenth of a human's arm speed. To compensate for this physical limitation, the robot must accurately predict the ball's trajectory in advance and plan its own swing path — all of which emerges automatically during end-to-end training without any hand-crafted prior.

Loading caption...

Further evaluations were conducted to probe RDT2-UltraFast's proficiency in manipulating deformable objects. Such objects, characterized by their near-infinite degrees of freedom, present substantial challenges to previous paradigms that rely on explicit computer graphics models or motion planning. The inherent complexities suggest that an end-to-end training approach, such as the one employed by RDT2, may be uniquely suited to mastering these tasks.

Loading caption...

In the fabric folding task, the model exhibited an ability to understand and manipulate the complex dynamics of cloth. The task required not only precise control but also an intuitive grasp of how the fabric would respond to various manipulations. Besides, we observed a remarkable degree of generalization. The model autonomously adapted its manipulation strategy to fold previously unseen garments of varying textures and sizes successfully.

Loading caption...

The final experiment assessed the performance on a long-horizon, multi-stage task: setting a table according to human instructions. While long-range tasks are often susceptible to the accumulation of errors, RDT2-UltraFast demonstrated the ability to avoid deviations from the intended plan. We hypothesize that this resilience stems from the extensive and diverse scenarios encountered during pre-training, which effectively mitigates the out-of-distribution (OOD) problem.

Loading caption...

Author Team

Core Team

We are proud to note that all core team members contributed equally to the success of this project.

Songming Liu: Team Leader, Data Quality, Model, DEMO Screenwriter & Editor (part)
Bangguo Li: Data Collection, Annotation, Deployment, DEMO Tech.
Kai Ma: UMI Hardware, Data Collection, RL, Deployment
Lingxuan Wu: Data Curation, Model, Training

Other Contributors

Hengkai Tan, Xiao Ouyang, Zhengyi Wang, Huayu Chen

Advisors

Hang Su, Jun Zhu

Citation

If you find our work helpful, please cite us:

@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}

Thank you!