{"id":190456,"date":"2025-10-04T23:26:35","date_gmt":"2025-10-04T23:26:35","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/190456\/"},"modified":"2025-10-04T23:26:35","modified_gmt":"2025-10-04T23:26:35","slug":"how-one-ai-model-creates-a-physical-intuition-of-its-environment","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/190456\/","title":{"rendered":"How One AI Model Creates a Physical Intuition of Its Environment"},"content":{"rendered":"<p>Here\u2019s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren\u2019t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object\u2019s permanence, learned through observation. Now some artificial intelligence models do too.<\/p>\n<p>Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of \u201csurprise\u201d when presented with information that goes against the knowledge it has gleaned.<\/p>\n<p>The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.<\/p>\n<p>\u201cTheir claims are, a priori, very plausible, and the results are super interesting,\u201d says <a href=\"https:\/\/www.uva.nl\/en\/profile\/h\/e\/m.heilbron\/m.heilbron.html\" rel=\"nofollow noopener\" target=\"_blank\">Micha Heilbron<\/a>, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.<\/p>\n<p>Higher Abstractions<\/p>\n<p>As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to \u201cunderstand\u201d videos in order to either classify their content (\u201ca person playing tennis,\u201d for example) or identify the contours of an object \u2014 say, a car up ahead \u2014 work in what\u2019s called \u201cpixel space.\u201d The model essentially treats every pixel in a video as equal in importance.<\/p>\n<p>But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. \u201cWhen you go to images or video, you don\u2019t want to work in [pixel] space because there are too many details you don\u2019t want to model,\u201d said <a href=\"https:\/\/vivo.brown.edu\/display\/rbalestr\" rel=\"nofollow noopener\" target=\"_blank\">Randall Balestriero<\/a>, a computer scientist at Brown University.<\/p>\n<p>        <img loading=\"lazy\" width=\"1600\" height=\"1090\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/10\/Yann_LeCun-cr-Ecole-polytechnique-Universite-Paris-Saclay.webp.webp\" class=\"block fit-x fill-h fill-v is-loaded mxa\" alt=\"Portrait of a man wearing glasses\" decoding=\"async\"  \/>    <\/p>\n<p>Yann LeCun, a computer scientist at New York University and the director of AI research at Meta, created JEPA, a predecessor to V-JEPA that works on still images, in 2022.<\/p>\n<p>E\u0301cole Polytechnique Universite\u0301 Paris-Saclay<\/p>\n<p>The V-JEPA architecture, released in 2024, is designed to avoid these problems. While the specifics of the various artificial neural networks that comprise V-JEPA are complex, the basic concept is simple.<\/p>\n<p>Ordinary pixel-space systems go through a training process that involves masking some pixels in the frames of a video and training neural networks to predict the values of those masked pixels. V-JEPA also masks portions of video frames. But it doesn\u2019t predict what\u2019s behind the masked regions at the level of individual pixels. Rather, it uses higher levels of abstractions, or \u201clatent\u201d representations, to model the content.<\/p>\n<p>Latent representations capture only essential details about data. For example, given line drawings of various cylinders, a neural network called an encoder can learn to convert each image into numbers representing fundamental aspects of each cylinder, such as its height, width, orientation and location. By doing so, the information contained in hundreds or thousands of pixels is converted into a handful of numbers \u2014 the latent representations. A separate neural network called a decoder then learns to convert the cylinder\u2019s essential details into an image of the cylinder.<\/p>\n<p>V-JEPA focuses on creating and reproducing latent representations. At a high level, the architecture is split into three parts: encoder 1, encoder 2, and a predictor. First, the training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Sometimes, the final few frames of the video are fully masked. Encoder 1 converts the masked frames into latent representations. The algorithm also feeds the unmasked frames in their entirety into encoder 2, which converts them into another set of latent representations.<\/p>\n<p>Now the predictor gets into the act. It uses the latent representations produced by encoder 1 to predict the output of encoder 2. In essence, it takes latent representations generated from masked frames and predicts the latent representations generated from the unmasked frames. By re-creating the relevant latent representations, and not the missing pixels of earlier systems, the model learns to see the cars on the road and not fuss about the leaves on the trees.<\/p>\n<p>\u201cThis enables the model to discard unnecessary \u2026 information and focus on more important aspects of the video,\u201d said <a href=\"https:\/\/garridoq.com\/\" rel=\"nofollow noopener\" target=\"_blank\">Quentin Garrido<\/a>, a research scientist at Meta. \u201cDiscarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.\u201d<\/p>\n<p>Once this pretraining stage is complete, the next step is to tailor V-JEPA to accomplish specific tasks such as classifying images or identifying actions depicted in videos. This adaptation phase requires some human-labeled data. For example, videos have to be tagged with information about the actions contained in them. The adaptation for the final tasks requires much less labeled data than if the whole system had been trained end to end for specific downstream tasks. In addition, the same encoder and predictor networks can be adapted for different tasks.<\/p>\n<p>Intuition Mimic<\/p>\n<p>In February, the V-JEPA team <a href=\"https:\/\/arxiv.org\/html\/2502.11831v1\" rel=\"nofollow noopener\" target=\"_blank\">reported<\/a> how their systems did at understanding the intuitive physical properties of the real world \u2014 properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called <a href=\"https:\/\/arxiv.org\/abs\/1803.07616\" rel=\"nofollow noopener\" target=\"_blank\">IntPhys<\/a>, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98% accurate. A well-known model that predicts in pixel space was only a little better than chance.<\/p>\n<p>        <img loading=\"lazy\" width=\"1600\" height=\"1160\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/10\/Robot-arm-cr.Wladimir-Bulgar-Science-Photo-Library.webp.webp\" class=\"block fit-x fill-h fill-v is-loaded mxa\" alt=\"\" decoding=\"async\"  \/>    <\/p>\n<p>Autonomous robots need something like a physical intuition in order to plan their movements and interact with the physical environment.<\/p>\n<p>Wladimir Bulgar\/Science Photo Library<\/p>\n<p>The V-JEPA team also explicitly quantified the \u201csurprise\u201d exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn\u2019t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants. V-JEPA, one could say, was surprised.<\/p>\n<p>Heilbron is impressed by V-JEPA\u2019s ability. \u201cWe know from developmental literature that babies don\u2019t need a lot of exposure to learn these types of intuitive physics,\u201d he said. \u201cIt\u2019s compelling that they show that it\u2019s learnable in the first place, and you don\u2019t have to come with all these innate priors.\u201d<\/p>\n<p><a href=\"https:\/\/profiles.ucl.ac.uk\/2747-karl-friston\" rel=\"nofollow noopener\" target=\"_blank\">Karl Friston<\/a>, a computational neuroscientist at University College London, thinks that V-JEPA is on the right track in terms of mimicking the \u201cway our brains learn and model the world.\u201d However, it still lacks some fundamental elements. \u201cWhat is missing from [the] current proposal is a proper encoding of uncertainty,\u201d he said. For example, if the information in the past frames isn\u2019t enough to accurately predict the future frames, the prediction is uncertain, and V-JEPA doesn\u2019t quantify this uncertainty.<\/p>\n<p>In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, <a href=\"https:\/\/arxiv.org\/abs\/2506.09985\" rel=\"nofollow noopener\" target=\"_blank\">V-JEPA 2<\/a>, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot\u2019s next action. \u201cSuch a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,\u201d Garrido said.<\/p>\n<p>To push V-JEPA 2, the team designed a more difficult benchmark for intuitive physics understanding, called <a href=\"https:\/\/www.arxiv.org\/abs\/2506.09849\" rel=\"nofollow noopener\" target=\"_blank\">IntPhys 2<\/a>. V-JEPA 2 and other models did only slightly better than chance on these tougher tests. One reason, Garrido said, is that V-JEPA 2 can handle only about a few seconds of video as input and predict a few seconds into the future. Anything longer is forgotten. You could make the comparison again to infants, but Garrido had a different creature in mind. \u201cIn a sense, the model\u2019s memory is reminiscent of a goldfish,\u201d he said.<\/p>\n","protected":false},"excerpt":{"rendered":"Here\u2019s a test for infants: Show them a glass of water on a desk. Hide it behind a&hellip;\n","protected":false},"author":2,"featured_media":190457,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-190456","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/190456","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=190456"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/190456\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/190457"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=190456"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=190456"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=190456"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}