← Back to context

Comment by thomasahle

19 days ago

> This matters because (1) the world cannot be modeled anywhere close to completely with language alone

LLMs being "Language Models" means they model language, it doesn't mean they "model the world with language".

On the contrary, modeling language requires you to also model the world, but that's in the hidden state, and not using language.

Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.

LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).

LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.

So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.

Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.

  • I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?

    But I know very little about this.

    • you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.

      1 reply →

  • They do not model the world.

    They present a statistical model of an existing corpus of text.

    If this existing corpus includes useful information it can regurgitate that.

    It cannot, however, synthesize new facts by combining information from this corpus.

    The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.

    • If we focus on base models and ignore the tuning steps after that, then LLMs are "just" a token predictor. But we know that pure statistical models aren't very good at this. After all we tried for decades to get Markov chains to generate text, and it always became a mess after a couple of words. If you tried to come up with the best way to actually predict the next token, a world model seems like an incredibly strong component. If you know what the sentence so far means, and how it relates to the world, human perception of the world and human knowledge, that makes guessing the next word/token much more reliable than just looking at statistical distributions.

      The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language

      I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault

      10 replies →

    • > It cannot, however, synthesize new facts by combining information from this corpus.

      That would be like saying studying mathematics can't lead to someone discovering new things in mathematics.

      Nothing would ever be "novel" if studying the existing knowledge could not lead to novel solutions.

      GPT 5.2 Thinking is solving Erdős Problems that had no prior solution - with a proof.

      1 reply →

    •   It cannot, however, synthesize new facts by combining information from this corpus.
      

      Are we sure? Why can't the LLM use tools, run experiments, and create new facts like humans?

      8 replies →