← Back to context

Comment by astrange

18 days ago

There's no such thing as a "world model". That is metaphor-driven development from GOFAI, where they'd just make up a concept and assume it existed because they made it up. LLMs are capable of approximating such a thing because they are capable of approximating anything if you train them to do it.

> or, to the extent they do, this world model is solely based on token patterns

Obviously not true because of RL environments.

> There's no such thing as a "world model"

There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).

We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.

There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.

> Obviously not true because of RL environments.

Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.

  • > When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.

    Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

    > We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.

    All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.

    • > Modeling something as an action is not "having a world model".

      It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.