{"id":51185,"date":"2025-07-31T23:35:27","date_gmt":"2025-07-31T23:35:27","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/51185\/"},"modified":"2025-07-31T23:35:27","modified_gmt":"2025-07-31T23:35:27","slug":"teaching-robots-with-generated-video-siebel-school-of-computing-and-data-science","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/51185\/","title":{"rendered":"Teaching robots with generated video | Siebel School of Computing and Data Science"},"content":{"rendered":"<p>How do you teach a robot to do something like watering a plant or ironing a shirt?<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/07\/1754004926_200_viewphoto.aspx\" alt=\"A grid of four images, two with generated AI images of pouring water and a pot, and two with a robot arm matching the action.\" width=\"800\" data-fancy-caption=\"&lt;p&gt;RIGVid: Robots Imitating Generated Videos&lt;\/p&gt;\" loading=\"lazy\"\/>RIGVid: Robots Imitating Generated Videos<\/p>\n<p>Computer science professor <a href=\"https:\/\/siebelschool.illinois.edu\/about\/people\/all-faculty\/slazebni\" rel=\"nofollow noopener\" target=\"_blank\">Svetlana Lazebnik<\/a> and Ph.D. student <a href=\"https:\/\/shivanshpatel35.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">Shivansh Patel<\/a> from The Grainger College of Engineering Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign say that \u201ctraditionally, we must perform laborious, task-specific demonstrations to collect data to teach robots. While recent works try to avoid this by imitating from videos, they need to perfectly match the videos to the robot\u2019s environment, and the video collection process itself remains laborious.\u201d<\/p>\n<p>Looking at the rapid progress of state-of-the-art video generators like\u00a0OpenAI Sora and Kling AI,\u00a0they asked: \u201cWhat if, instead of collecting real videos, we could just generate the demonstration video on the fly? The idea was to generate a single video tailored exactly to the robot&#8217;s starting environment and the specific task.\u201d Accordingly, they created a method entitled <a href=\"https:\/\/rigvid-robot.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">RIGVid: Robots Imitating Generated Videos<\/a>.\u00a0Collaborating on this project with\u00a0Lazebnik and Patel are Illinois&#8217; Master of Computer Science student <a href=\"https:\/\/www.linkedin.com\/in\/shraddhaamohan\" rel=\"nofollow noopener\" target=\"_blank\">Shraddhaa Mohan<\/a>, Department of Electrical &amp; Computer Engineering\u00a0Ph.D. student <a href=\"https:\/\/hanlinmai.web.illinois.edu\/\" rel=\"nofollow noopener\" target=\"_blank\">Hanlin Mai<\/a>, alumnus\u00a0<a href=\"https:\/\/unnat.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">Unnat Jain<\/a> (&#8217;22, Ph.D., Computer Science), now at\u00a0UC Irvine, and Columbia University CS professor <a href=\"https:\/\/www.engineering.columbia.edu\/faculty-staff\/directory\/yunzhu-li\" rel=\"nofollow noopener\" target=\"_blank\">Yunzhu Li<\/a>.\u00a0<\/p>\n<p class=\"text-center\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/07\/1754004926_92_viewphoto.aspx\" alt=\"Associate Professor Svetlana Lazebnik\" width=\"200\" data-fancy-caption=\"&lt;p&gt;Svetlana Lazebnik&lt;\/p&gt;\" loading=\"lazy\"\/>Svetlana Lazebnik<\/p>\n<p>Lazebnik and Patel explain that, as opposed to the time-consuming process of data collection and manual recording of videos that match a robot\u2019s setup, the\u00a0team \u201csimply gives a text command, like \u2018pour water on the plant,\u2019 and an image of the current scene. The video generation model generates the demonstration video. One of the advantages of our method is that it doesn\u2019t actually involve any training \u2013 all the \u2018training\u2019 is already encapsulated in the pre-trained video generation model. After the desired video demonstration is generated, all we have to do is extract the object trajectory from the video and imitate it with the robot.\u201d<\/p>\n<p>They use a monocular depth estimator to predict a corresponding depth map for every frame of the generated video. The ground-truth depth map from the initial RGB-D image of the real scene is used to &#8220;ground&#8221; the generated motion in real-world units. So, the robot ultimately relies on both the visual content from the generated video and the corresponding estimated depth data to extract the full six degrees of freedom (6DoF) trajectory of the object.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/07\/1754004927_485_viewphoto.aspx\" alt=\"Shivansh Patel\" width=\"200\" data-fancy-caption=\"&lt;p&gt;Shivansh Patel&lt;\/p&gt;\" loading=\"lazy\"\/>Shivansh Patel<\/p>\n<p>RIGVid was shown to outperform alternative robot imitation methods such as Stanford\u2019s ReKep, which uses a Vision-Language Model to predict a sequence of keypoints for the object to follow. The rich detail of a generated video played a crucial role as RIGVid achieved an 85% success rate across tasks, while the keypoint-based method succeeded only 50% of the time.<\/p>\n<p>RIGVid\u2019s approach for extracting an object\u2019s motion from a generated video, based on model-based six DoF object pose tracking, also outperformed alternative methods for trajectory extraction using optical flow and sparse keypoints. The improvement was apparent on challenging tasks, like sweeping dirt or placing a thin spatula into a pan, where the other methods struggled with object occlusion.<\/p>\n<p>Even better, Lazebnik and Patel note that \u201cRIGVid is embodiment-agnostic, meaning it&#8217;s not tied to a single robot. We successfully transferred to a different robot platform and even extended it to bimanual (two-armed) tasks, where it placed a pair of shoes into a box simultaneously.\u201d<\/p>\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=wYiZjmv3mu8\" rel=\"nofollow noopener\" target=\"_blank\">See the RIGVid robot in action<\/a>.<\/p>\n<p>Grainger Engineering Affiliations<\/p>\n<p><a href=\"https:\/\/siebelschool.illinois.edu\/about\/people\/all-faculty\/slazebni\" rel=\"nofollow noopener\" target=\"_blank\">Svetlana Lazebnik<\/a> is an Illinois Grainger Engineering professor of computer science. Svetlana Lazebnik is a Willett Faculty Scholar.<\/p>\n","protected":false},"excerpt":{"rendered":"How do you teach a robot to do something like watering a plant or ironing a shirt? RIGVid:&hellip;\n","protected":false},"author":2,"featured_media":51186,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[46],"tags":[191,74],"class_list":{"0":"post-51185","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-computing","9":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/51185","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=51185"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/51185\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/51186"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=51185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=51185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=51185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}