Why Fei-Fei Li and Yann LeCun Are Both Betting on “World Models” — and How Their Bets Differ

AI has finally reached the “we need to model the entire world” stage.

That same season, Fei-Fei Li’s World Labs shipped marbleA “Multimodal World Model” that turns signals into walkable 3D scenes in your browser, and reports emerged that Meta’s chief AI scientist Yann LeCun is leaving to build one world model Own startup. Meanwhile, DeepMind is calling its new interactive video engine genie 3 Also a world model.

Same phrase. Three very different bets.

This week “world models” went mainstream

World Labs has spent a year crafting a neat story: Fei-Fei Li’s manifesto, From words to worlds: Spatial intelligence is the next frontier of AIargue that language-only systems (LLMs) are a dead end and the real frontier is “spatial intelligence” and “world models” that understand 3D space, physics, and action. On top of this is the launch of Marble, which promises that anyone can now generate editable 3D worlds from text, images, video or simple layouts.

Around the same time, outlets like Nasdaq reported that LeCun was preparing to leave Meta and raise money for a company “focused on world models”, in a completely different sense than he has been sketching since his 2022 paper. A path toward autonomous machine intelligence (Nasdaq, paper PDF).

On Hacker News, the Marble Launch thread is full of arguments about Gaussian Splats and Game Engines (HN). The LeCun thread is full of arguments over whether Meta chose “AI slopware” instead of proper research. Same words, different fights.

To understand why, we need to start with the only thing anyone can actually click on.

World Labs’ World Model: Gaussian Splats for Humans

marbleAs shipped today, there is a full-stack 3D content pipeline:

  • it Text prompts, single images, short videos or blocky 3D layouts,
  • It hallucinates a 3d representation Of a scene.
  • it gives you walk around Place that scene in a web or VR viewer and modify it with an in-browser editor called Chisel.
  • exports as gaussian splatterStandard Net (OBJ/FBX) or flat Video For downstream tools (Marble Docs, Radiancefields Explainer).

For those who ship VR apps or game levels, a pipeline that does “Prompt → 3D World → Export to three.js/Unity” is extremely useful. World Labs also ships its own Three.js renderer, Spark, which is tuned specifically for Splats (Spark release).

but it’s too much 3d assets Story. On Marble’s own blog, “World model” sits in the same sentence as “Export Gaussian splats, meshes, and video”; No robot is visible.

Hacker News users spotted it immediately. An early top-level comment, comparing Marble to DeepMind’s video-based Genie, says:

“Genie delivers on-the-fly generated video that responds to user input in real time. Marble renders a static Gaussian splat asset (like a 3D game engine asset) that you render in the game engine.”

Another says, with the characteristic baffling humility of an ML engineer:

“Isn’t this a Gaussian splat model? I work in AI and to this day, I don’t know what they mean by ‘world’ in ‘world model’.”

Reddit is less shy. In a thread about the first demo of a “$230 million startup led by Fei-Fei Li” in r/StableDiffusion, a commenter described it this way:

“Taking images and turning them into a 3D environment using Gaussian splats, depth, and inpainting. Nice, but it’s a 3D GS pipeline, not a robot brain.”

(reddit thread)

This does not damage the marble. This makes the use of “world models” a bit ambitious. To see why, you need a quick primer on what a Gaussian splat actually is.

If you’re not a 3D person, the splashy 2025 speech may sound like hand-waving. In practice, there are three characters:

  • photogrammetry – The old guard. Take hundreds of overlapping photos of a real thing, reconstruct polygon mesh (a shell made of small triangles), and bake the texture on top. Great if you want to measure, bump, or 3D-print.

  • 3d gaussian splatter – New hotness. Render the scene as millions of vaguely colored blobs (“Gaussians”) floating in space, and “split” them on the screen so that they blend into a single image. Excellent in foliage, hair and soft light; Gaming runs in real time on GPUs. The canonical paper is by Kerbal et al. 3D Gaussian Splatting for Real-Time Radiance Field Rendering,

  • donor – Engines like Three.js, Unity or Unreal that take a mesh or splat cloud and turn it into pixels.

A photogrammetry practitioner on r/Photogrammetry makes the compromise this way:

“If you want to do something with the mesh itself then use photogrammetry, and if you want to skip all the steps and just show the scan as it is then use Gaussian splatting. It’s a shortcut to interactive photorealism.”

(Lecturer’s formula)

Marble lives fully in that world: it is a Shortcut to interactive photorealismIt generates splats/meshes and hands them to the renderer, The “world” it models is the part that we can see and move around in, It’s for humans (and game engines), not machines to think with,

However, Fei-Fei Li’s essay speaks in a different register.

She writes about “absorbing agents”, “common sense physics” and “robots that can understand the world and act” – all the things you want in a robot. internal model For support. The marble is presented as a “first step” on that road. The tension, and comedic potential, comes from the fact that Phase One is currently a very sophisticated 3DGS audience.

Ironically, Fei-Fei Li’s original manifesto, from words to worldThere was never any mention of 3D Gaussian splatting – the technology at the heart of Marble’s output pipeline.

If Marble was the only “world model” on offer, you might reasonably conclude that the term has been hijacked by marketing. Unfortunately for your hot take, Yann LeCun exists.

Lacan’s world model: the brain in the middle

LeCun’s use of “world models” comes from control theory and cognitive science rather than 3D graphics.

In A path toward autonomous machine intelligence (PDF), he describes a system in which:

  • A world model Receives streams of sensory data.
  • it learns latent state: Compressed internal variables that capture “what’s going on there”.
  • it learns predict how that latent state will develop When the agent (or environment) acts.
  • A separate module uses that machinery Plan Select More actions.

You never see the world model directly. It doesn’t need to output pretty pictures. Its function is to allow the agent to think a few steps ahead.

JEPA-style models – “Joint Embedding Predictive Architecture” – are early examples of this approach: instead of predicting raw pixels, they predict masked or future embeddings and are trained on them. useful representations rather than full rendering. LeCun has been talking about this since at least 2022 (Youtube).

When Nasdaq and others reported that it was planning to create a world-model startup (Nasdaq), the reaction at HN was not, “Oh, another 3D viewer.” it was:

  • Does this mean that Meta has abandoned this line of research in favor of GPT-ish products?
  • Can an architecture like JEPA ever match LLM in practical utility?
  • Is there even a market for a world model that lives mostly in diagrams and robot labs?

Whether you think Lacan is right or wrong, you can’t really accuse him of pursuing the same thing as World Labs. A “world model” is essentially a front-end asset generator. The second is a back-end predictive brain.

And then there’s DeepMind, which sits happily in the middle.

DeepMind’s world model: The world as video

DeepMind’s genie 3 The model has been introduced, without much politeness, as “a new frontier for world models” (blog).

From a text prompt, it generates a interactive video environment At 720p/24 fps in which you (or an agent) can roam around for several minutes. Objects remain across the frame, you can “cue” world events (“it starts raining”), and the whole thing acts as a mini video game rendered by a model rather than a traditional engine.

The Guardian describes it as a way for AI agents and robots to “train in virtual warehouses and ski slopes” before touching down in the real world (Guardian). DeepMind is perfectly happy to tie this into the AGI narrative.

where marble is produced Property And Lekun dreams latentgenie produces 3 simulator: Online environment where you can perform tasks, observe results and learn.

On HN, when someone asks “How does marble compare?”, a common answer is:

“The Genie is on-the-fly generated video that responds to user input in real time. The marble is a static Gaussian splat asset that you render in the game engine.”

Again, no insult – just classification.

One word, three bets

Put all this together and the “world model” now encompasses at least three different ideas:

  1. World model as interface
    marble is a beautiful way to go there words and flat media To 3D environments that humans can edit and share“The world” is what your Quest headset needs,

  2. World model as simulator
    Genie style models Continuous, controllable video world Where agents can try things, fail and try again. The “world” is what keeps the game loop coherent.

  3. World model as cognition
    LeCun style architecture is about internal predicted stateThe “world” lives inside an agent in the form of latent variables and transition functions,

Fei-Fei Li’s writing borrows heavily from bucket (3) – embodied agents, intuitive physics – while Marble, so far, mostly occupies bucket (1). LeCun’s plans remain solely in (3), with the hope that someone, someday, will build a good version of (2) on top. Genie lives between (2) and (3), all of which have occasional marketing holidays.

If you just look at Marble’s demo, it’s tempting to say that the “world model” is just 3DGS with better PR. If you only read LeCun, it would be tempting to believe that the language model was a watershed moment and that JEPA will save us all. If you only read DeepMind, it simulates ski slopes all the way down.

The truth is that they are all building different parts of the same vague ambition: Give machines some structured way to think about the world, beyond next-token prediction. One group starts with rendering, one with physics, one with internal code.

Until the jargon takes hold, the safest move when you see the title “World Model” is to ask three questions:

  1. this one visible to humansA place to train agentsor a box inside diagram,
  2. what it outputs static assets, actual deadlinesor mostly latent states,
  3. If you knock over a virtual vase, does the system remember anything for more than one frame?

If the answers are “for humans”, “static assets” and “not really”, then you’re basically looking at a pretty cool Gaussian splat viewer. If they’re “for agents”, “real time” and “yes, in latent space”, you might just be looking at the world model that Lecan is talking about – one that, very inconveniently for demo culture, doesn’t fit into a single tweetable GIF.



Leave a Comment