You’ve probably seen an artificial intelligence system go off track. You ask for a video of a dog, and as the dog runs behind the love seat, its collar disappears. Then, as the camera pans back, the love seat becomes a sofa.
Part of the problem lies in the predictive nature of many AI models. Like the models that power ChatGPT, which are trained to predict text, video generation models predict what is statistically most plausible to look right next. In neither case does the AI hold a clearly defined model of the world that it continuously updates to make more informed decisions.
But that’s starting to change as researchers across many AI domains work on creating “world models,” with implications that extend beyond video generation and chatbot use to augmented reality, robotics, autonomous vehicles and even humanlike intelligence—or artificial general intelligence (AGI).
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
A simple way to understand world modeling is through four-dimensional, or 4D, models (three dimensions plus time). To do this, let’s think back to 2012, when Titanic, 15 years after its theatrical release, was painstakingly converted into stereoscopic 3D. If you were to freeze any frame, you would have an impression of distance between characters and objects on the ship. But if Leonardo DiCaprio had his back to the camera, you wouldn’t be able to walk around him to see his face. Cinema’s illusion of 3D is made using stereoscopy—two slightly different images often projected in rapid alternation, one for the left eye and one for the right. Everyone in the cinema sees the same pair of images and thus a similar perspective.
Multiple perspectives are, however, increasingly possible thanks to the past decade of research. Imagine realizing you should have shot a photo from a different angle and then having AI make that adjustment, giving the same scene with a new perspective. Starting in 2020, NeRF (neural radiance field) algorithms offered a path to create “photorealistic novel views” but required combining many photos so that an AI system could generate a 3D representation. Other 3D approaches use AI to fill in missing information predictively, deviating more from reality.
Now, imagine that every frame in Titanic were represented in 3D so that the movie existed in 4D. You could scroll through time to see different moments or scroll through space to watch it from different perspectives. You could also generate new versions of it. For instance, a recent preprint, “NeoVerse: Enhancing 4D World Model with in-the-Wild Monocular Videos,” describes one way of turning videos into 4D models to generate new videos from different perspectives.
But 4D techniques can also help generate new video content. Another recent preprint, “TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model,” applies to the scenario with which we began: the dog running behind the love seat. The authors argue that the stability of AI video systems improves when a continuously updated 4D world model guides generation. The system’s 4D model would help to prevent the love seat from becoming a couch and the dog from losing its collar.
These are early results, but they hint at a broader trend: models that update an internal scene map as they generate. Yet 4D modeling has applications far beyond video generation. For augmented reality (AR)—think Meta’s Orion prototype glasses—a 4D world model is an evolving map of the user’s world over time. It allows AR systems to keep virtual objects stable, to make lighting and perspective believable and to have a spatial memory of what recently happened. It also allows for occlusions—when digital objects disappear behind real ones. A 2023 paper puts the requirement bluntly: “To achieve occlusion, a 3D model of the physical environment is required.”
Being able to rapidly convert videos into 4D also provides rich data for training robots and autonomous vehicles on how the real world works. And by generating 4D models of the space they’re in, robots could navigate it better and predict what might happen next. Today’s general-purpose vision-language AI models—which understand images and text but do not generate clearly defined world models—often make errors; a benchmark paper presented at a 2025 conference reports “striking limitations” in their basic world-modeling abilities, including “near-random accuracy when distinguishing motion trajectories.”
Here’s the catch: “world model” means much more to those pursuing AGI. For instance, today’s leading large language models (LLMs), such as those powering ChatGPT, have an implicit sense of the world from their training data. “In a way, I would say that the LLM already has a very good world model; it’s just we don’t really understand how it’s doing it,” says Angjoo Kanazawa, an assistant professor of electrical engineering and computer sciences at University of California, Berkeley. These conceptual models, though, aren’t a real-time physical understanding of the world because LLMs can’t update their training data in real time. Even OpenAI’s technical report notes that, once deployed, its model GPT-4 “does not learn from experience.”
“How do you develop an intelligent LLM vision system that can actually have streaming input and update its understanding of the world and act accordingly?” Kanazawa says. “That’s a big open problem. I think AGI is not possible without actually solving this problem.”
Though researchers debate whether LLMs could ever attain AGI, many see LLMs as a component of future AI systems. The LLM would act as the layer for “language and common sense to communicate,” Kanazawa says; it would serve as an “interface,” whereas a more clearly defined underlying world model would provide the necessary “spatial temporal memory” that current LLMs lack.
In recent years a number of prominent AI researchers have turned toward world models. In 2024 Fei Fei Li founded World Labs, which recently launched its Marble software to create 3D worlds from “text, images, video, or coarse 3D layouts,” according to the start-up’s promotional material. And last November AI researcher Yann LeCun announced on LinkedIn that he was leaving Meta to launch a start-up, now called Advanced Machine Intelligence (AMI Labs), to build “systems that understand the physical world, have persistent memory, can reason, and can plan complex action sequences.” He seeded these ideas in a 2022 position paper in which he asked why humans can act well in situations they’ve never encountered and argued the answer “may lie in the ability… to learn world models, internal models of how the world works.” Research increasingly shows the benefits of internal models. An April 2025 Nature paper reported results on DreamerV3, an AI agent that, by learning a world model, can improve its behavior by “imagining” future scenarios.
So while in the context of AGI, “world model” refers more closely to an internal model of how reality works, not just 4D reconstructions, advances in 4D modeling could provide components that help with understanding viewpoints, memory and even short-term prediction. And meanwhile, on the path to AGI, 4D models can provide rich simulations of reality in which to test AIs to ensure that when we do let them operate in the real world, they know how to exist in it.