Three ways AI is learning to understand the physical world

World models
Large language models are running into limitations in domains that require an understanding of the physical world – from robotics to autonomous driving to manufacturing. This hurdle is pushing investors towards the World model, with AMI Labs raising a seed round of $1.03 billion, soon after World Labs raised $1 billion.

Large language models (LLMs) excel at processing abstract knowledge through next-token prediction, but they fundamentally lack physical causality. They cannot reliably predict the physical consequences of real-world actions.

AI researchers and thought leaders are becoming increasingly vocal about these limitations as the industry tries to push AI out of web browsers and into physical spaces. In an interview with podcaster Dwarkesh Patel, Turing Award recipient Richard Sutton warned that LLMs copy what people say rather than model the world, which limits their ability to learn from experience and adjust to changes in the world.

This is why models based on LLM, including vision-language models (VLMs), can show brittle behavior and break down with very small changes in their inputs.

Google DeepMind CEO Demis Hassabis echoed this sentiment in another interview, pointing out that today’s AI models suffer from “chained intelligence.” They can solve complex math Olympiads but fail in basic physics because they lack critical abilities with respect to real-world dynamics.

To solve this problem, researchers are focusing on creating world models that act as internal simulators, allowing AI systems to safely test hypotheses before taking physical action. However, “world model” is a broad term that includes many distinct architectural approaches.

This has created three different architectural approaches, each with different tradeoffs.

JPA: built for real time

The first main approach focuses on learning latent representations rather than trying to predict the dynamics of the world at the pixel level. Supported by AMI Labs, this method is largely based on Joint Embedding Predictive Architecture (JEPA).

JEPA models try to mimic how humans understand the world. When we observe the world, we don’t remember every pixel or irrelevant detail of a scene. For example, if you see a car driving down the road, you track its trajectory and speed; You don’t calculate the exact reflection of light on each leaf of the trees in the background.

JEPA models reproduce this human cognitive shortcut. Instead of forcing the neural network to predict what the next frame of video will look like, the model learns a small set of abstract, or “latent” features. It discards irrelevant details and focuses solely on the core rules of how elements in the scene interact. This makes the model robust against background noise and small variations that break other models.

This architecture is highly compute and memory efficient. By ignoring irrelevant details, it requires far fewer training examples and runs with significantly lower latency. These features make it suitable for applications where efficiency and real-time predictability cannot be compromised, such as robotics, self-driving cars, and high-stakes enterprise workflows.

For example, AMI is partnering with healthcare company Nabla to use this architecture to simulate operational complexity and reduce cognitive load in fast-paced healthcare settings.

Yann LeCun, pioneer of the JEPA architecture and co-founder of AMI, explained that the world model based on JEPA has been designed "Controllable in the sense that you can give them goals, and through construction, they can only accomplish those goals." In an interview with Newsweek.

Gaussian Splats: Built for Space

The second approach relies on generic models to create the entire spatial environment from scratch. This method, adopted by companies like World Labs, takes an initial signal (this could be an image or a textual description) and uses a generative model to create a 3D Gaussian splat. Gaussian splat is a technique for representing 3D scenes using millions of tiny, mathematical particles that define geometry and lighting. Unlike flat video generation, these 3D representations can be directly imported into standard physics and 3D engines, such as Unreal Engine, where users and other AI agents can freely navigate and interact with them from any angle.

The primary benefit here is a huge reduction in the time and one-time generation costs required to create complex interactive 3D environments. This addresses the exact problem outlined by World Labs founder Fei-Fei Li, who said that LLMs are ultimately like “wordsmiths in the dark”, who have flowery language but lack spatial intelligence and physical experience. World Labs’ Marble model gives AI the missing spatial awareness.

Although this approach is not designed for split-second, real-time execution, it has enormous potential for creating static training environments for spatial computing, interactive entertainment, industrial design, and robotics. The enterprise value is evident in Autodesk’s heavy support of World Labs to integrate these models into their industrial design applications.

End-to-end generation: built for scale

The third approach uses an end-to-end generative model to process signals and user actions, generating consistent visuals, physical dynamics, and immediate responses. Instead of exporting a static 3D file to an external physics engine, the model itself acts as the engine. It ingests an initial signal with a continuous stream of user actions, and it generates subsequent frames of the environment in real time, seamlessly calculating physics, lighting, and object reactions.

DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models provide infinite interactive experiences and an extremely simple interface for generating massive amounts of synthetic data. DeepMind demonstrated this seamlessly with Genie 3, showing how the model maintains strict object persistence and consistent physics at 24 frames per second without relying on a separate memory module.

This approach translates directly into heavy-duty synthetic data factories. Nvidia Cosmos uses this architecture to scale synthetic data and physical AI logic, allowing autonomous vehicle and robotics developers to synthesize rare, dangerous edge-case conditions without the cost or risk of physical testing. waymo (a fellow Alphabet subsidiary) Adapting Genie 3 and building my own world model on top of it For training your self-driving cars.

The downside of this end-to-end generative method is the large computation cost required to continuously render the physics and pixels simultaneously. Nevertheless, investment is necessary to achieve the vision set out by Hassabis, who argues that a deep, intrinsic understanding of physical causality is needed because current AI lacks critical capabilities to operate safely in the real world.

What comes next: hybrid architecture

LLMs will continue to serve as logic and communication interfaces, but world models are establishing themselves as the foundational infrastructure for physical and spatial data pipelines. As the underlying models mature, we are seeing the emergence of hybrid architectures that build on the strengths of each approach.

For example, cybersecurity startup Deeptempo recently developed LogLM, a model that integrates elements of LLM and JEPA to detect anomalies and cyber threats from security and network logs.



<a href

Leave a Comment