MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

CoRL 2023 (Oral)
Best Paper/Best Student Paper Awards Finalist
Best Systems Paper Award Finalist

1Stanford, 2NVIDIA, 3Georgia Tech, 4UT Austin, 5Caltech, Equal Advising


Imitation Learning from human demonstrations is a promising paradigm to teach robots manipulation skills in the real world, but learning complex long-horizon tasks often requires an unattainable amount of demonstrations. To reduce the high data requirement, we resort to human play data — video sequences of people freely interacting with the environment using their hands. We hypothesize that even with different morphologies, human play data contain rich and salient information about physical interactions that can readily facilitate robot policy learning. Motivated by this, we introduce a hierarchical learning framework named MimicPlay that learns latent plans from human play data to guide low-level visuomotor control trained on a small number of teleoperated demonstrations. With systematic evaluations of 14 long-horizon manipulation tasks in the real world, we show that MimicPlay dramatically outperforms state-of-the-art imitation learning methods in task success rate, generalization ability, and robustness to disturbances.


Human is able to complete a long-horizon task much faster than a teleoperated robot. This observation inspires us to develop MimicPlay, a hierarchical imitation learning algorithm that learns a high-level planner from cheap human play data and a low-level control policy from a small amount of multi-task teleoperated robot demonstrations.


Overview of MimicPlay. (a) Training Stage 1: using cheap human play data to train a goal-conditioned trajectory generation model to build a latent plan space that contains high-level guidance for diverse task goals. (b) Training Stage 2: using a small amount of teleoperation data to train a low-level robot controller conditioned on the latent plans generated by the pre-trained (frozen) planner. (c) Testing: Given a single long-horizon task video prompt (either human motion video or robot teleoperation video), MimicPlay generates latent plans and guides the low-level controller to accomplish the task.


Evaluation Results: long-horizon & sample efficiency

Quantitative evaluation results in the Kitchen environment in terms of the number of successful runs out of all trials. Our approach outperforms prior works in terms of sample efficiency and long-horizon task success rate.


Ours (0min-human)

Ours (10min-human)

Evaluation Results: generalization

Training multiple tasks within one model — Ours (10min-human + 20 demos)

Task 1

Task 2

Task 3

Task 4

One-shot generalize to unseen temporal composition




After training multiple tasks within one model, MimicPlay is able to generalize (one-shot) to new tasks with unseen temporal compositions.

Evaluation Results: multi-task learning





Ours (10min-human + 20 demos)

MimicPlay has the smallest performance drop which highlights its capability to handle diverse tasks within one model.

Qualitative results of the learned latent plans

Qualitative visualization of the learned latent plan. (a) Visualization of the trajectory prediction results decoded from the latent plans learned by different methods. The fading color of the trajectory from blue to green indicates the time step from 1 to 10. (b) t-SNE visualization of latent plans, the latent plans of the same task tend to cluster in the latent space.

Evaluation results: robustness against disturbance

Qualitative visualization of the latent plans before the disturbance and re-planning. Column 1: third-person view. Column 2: visualization of the latent plan before disturbance. Column 3: human disturbance; the red arrow indicts the direction of disturbance. Column 4: visualization of real-time re-planning capabilities, which show robustness against disturbance. Column 5: robot recovers with the updated task plan. Video results can be found at the beginning of this page.

An interface for prompting robot motion with human videos

This is a no-cut video result of prompting robot manipulation with human motion (The robot is running the same trained model). MimicPlay integrates human motion and robot skills into a joint latent plan space, which enables an intuitive interface for directly specifying robot manipulation goals with human videos.


We introduce MimicPlay, a scalable imitation learning algorithm that exploits the complementary strengths of two data sources: cost-effective human play data and small-scale teleoperated robot demonstration data. Using human play data, the high-level controller learns goal-conditioned latent plans by predicting future 3D human hand trajectories given the goal image. Using robot demonstration data, the low-level controller then generates the robot actions from the latent plans. With this hierarchical design, MimicPlay outperforms prior arts by over 50% in 14 challenging long-horizon manipulation tasks. MimicPlay paves the path for future research to scale up robot imitation learning with affordable human costs.


    title={Mimicplay: Long-horizon imitation learning by watching human play},
    author={Wang, Chen and Fan, Linxi and Sun, Jiankai and Zhang, Ruohan and Fei-Fei, Li and Xu, Danfei and Zhu, Yuke and
    Anandkumar, Anima},
    journal={arXiv preprint arXiv:2302.12422},