← Back to context

Comment by astrange

18 days ago

> When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.

Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.

All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.

> Modeling something as an action is not "having a world model".

It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.