Teaching robots with generated video | Siebel School of Computing and Data Science

How do you teach a robot to do something like watering a plant or ironing a shirt?

A grid of four images, two with generated AI images of pouring water and a pot, and two with a robot arm matching the action. RIGVid: Robots Imitating Generated Videos

Computer science professor Svetlana Lazebnik and Ph.D. student Shivansh Patel from The Grainger College of Engineering Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign say that “traditionally, we must perform laborious, task-specific demonstrations to collect data to teach robots. While recent works try to avoid this by imitating from videos, they need to perfectly match the videos to the robot’s environment, and the video collection process itself remains laborious.”

Looking at the rapid progress of state-of-the-art video generators like OpenAI Sora and Kling AI, they asked: “What if, instead of collecting real videos, we could just generate the demonstration video on the fly? The idea was to generate a single video tailored exactly to the robot’s starting environment and the specific task.” Accordingly, they created a method entitled RIGVid: Robots Imitating Generated Videos. Collaborating on this project with Lazebnik and Patel are Illinois’ Master of Computer Science student Shraddhaa Mohan, Department of Electrical & Computer Engineering Ph.D. student Hanlin Mai, alumnus Unnat Jain (’22, Ph.D., Computer Science), now at UC Irvine, and Columbia University CS professor Yunzhu Li.

Associate Professor Svetlana Lazebnik Svetlana Lazebnik

Lazebnik and Patel explain that, as opposed to the time-consuming process of data collection and manual recording of videos that match a robot’s setup, the team “simply gives a text command, like ‘pour water on the plant,’ and an image of the current scene. The video generation model generates the demonstration video. One of the advantages of our method is that it doesn’t actually involve any training – all the ‘training’ is already encapsulated in the pre-trained video generation model. After the desired video demonstration is generated, all we have to do is extract the object trajectory from the video and imitate it with the robot.”

They use a monocular depth estimator to predict a corresponding depth map for every frame of the generated video. The ground-truth depth map from the initial RGB-D image of the real scene is used to “ground” the generated motion in real-world units. So, the robot ultimately relies on both the visual content from the generated video and the corresponding estimated depth data to extract the full six degrees of freedom (6DoF) trajectory of the object.

Shivansh Patel

RIGVid was shown to outperform alternative robot imitation methods such as Stanford’s ReKep, which uses a Vision-Language Model to predict a sequence of keypoints for the object to follow. The rich detail of a generated video played a crucial role as RIGVid achieved an 85% success rate across tasks, while the keypoint-based method succeeded only 50% of the time.

RIGVid’s approach for extracting an object’s motion from a generated video, based on model-based six DoF object pose tracking, also outperformed alternative methods for trajectory extraction using optical flow and sparse keypoints. The improvement was apparent on challenging tasks, like sweeping dirt or placing a thin spatula into a pan, where the other methods struggled with object occlusion.

Even better, Lazebnik and Patel note that “RIGVid is embodiment-agnostic, meaning it’s not tied to a single robot. We successfully transferred to a different robot platform and even extended it to bimanual (two-armed) tasks, where it placed a pair of shoes into a box simultaneously.”

See the RIGVid robot in action.

Grainger Engineering Affiliations

Svetlana Lazebnik is an Illinois Grainger Engineering professor of computer science. Svetlana Lazebnik is a Willett Faculty Scholar.

Teaching robots with generated video | Siebel School of Computing and Data Science

Tags: