Comment by D-Machine

19 days ago

Fun play on words. But yes, LLMs are Large Language Models, not Large World Models. This matters because (1) the world cannot be modeled anywhere close to completely with language alone, and (2) language only somewhat models the world (much in language is convention, wrong, or not concerned with modeling the world, but other concerns like persuasion, causing emotions, or fantasy / imagination).

It is somewhat complicated by the fact LLMs (and VLMs) are also trained in some cases on more than simple language found on the internet (e.g. code, math, images / videos), but the same insight remains true. The interesting question is to just see how far we can get with (2) anyway.

1. LLMs are transformers, and transformers are next state predictors. LLMs are not Language models (in the sense you are trying to imply) because even when training is restricted to only text, text is much more than language.

2. People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'. You don't. You run on a heavily filtered, tiny slice of reality. You think you understand electro-magnetism ? Tell that to the birds that innately navigate by sensing the earth's magnetic field. To them, your brain only somewhat models the real world, and evidently quite incompletely. You'll never truly understand electro-magnetism, they might say.

  • LLMs are language models, something being a transformer or next-state predictor does not make it a language model. You can also have e.g. convolutional language models or LSTM-based language models. This is a basic point that anyone with any proper understanding of these models would know.

    Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language. But, yes, as I said in another comment on this thread, it isn't that simple, because LLMs today are trained on tokens from tokenizers, and these tokenizers are trained on text that includes e.g. natural language, mathematical symbolism, and code.

    Yes, humans have incredibly limited access to the real world. But they experience and model this world with far more tools and machinery than language. Sometimes, in certain cases, they attempt to messily translate this messy, multimodal understanding into tokens, and then make those tokens available on the internet.

    An LLM (in the sense everyone means it, which, again, is largely a natural language model, but certainly just a tokenized text model) has access only to these messy tokens, so, yes, far less capacity than humanity collectively. And though the LLM can integrate knowledge from a massive amount of tokens from a huge amount of humans, even a single human has more different kinds of sensory information and modality-specific knowledge than the LLM. So humans DO have more privileged access to the real world than LLMs (even though we can barely access a slice of reality at all).

    • >LLMs are language models, something being a transformer or next-state predictors does not make it a language model. You can also have e.g. convolutional language models or LSTM-based language models. This is a basic point that anyone with any proper understanding of these models would know.

      'Language Model' has no inherent meaning beyond 'predicts natural language sequences'. You are trying to make it mean more than that. You can certainly make something you'd call a language model with convolution or LSTMs, but that's just a semantics game. In practice, they would not work like transformers and would in fact perform much worse than them with the same compute budget.

      >Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language.

      The major LLMs today are trained on trillions of tokens of text, much of which has nothing to do with language beyond the means of communication, millions of images and million(s) of hours of audio.

      The problem as I tried to explain is that you're packing more meaning into 'Language Model' than you should. Being trained on text does not mean all your responses are modelled via language as you seem to imply. Even for a model trained on text, only the first and last few layers of a LLM concerns language.

      2 replies →

  • > People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'.

    This is irrelevant, the point is that you do have access to a world which LLMs don't, at all. They only get the text we produce after we interact with the world. It is working with "compressed data" at all times, and have absolutely no idea what we subconsciously internalized that we decided not to write down or why.

    • All of the SOTA LLMs today are trained on more than text.

      It doesn't matter whether LLMs have "complete" (nothing does) or human-like world access, but whether the compression in text is lossy in ways that fundamentally prevent useful world modeling or reconstruction. And empirically... it doesn't seem to be. Text contains an enormous amount of implicit structure about how the world works, precisely because humans writing it did interact with the world and encoded those patterns.

      And your subconscious is far leakier than you imagine. Your internal state will bleed into your writing, one way or another whether you're aware of it or not. Models can learn to reconstruct arithmetic algorithms given just operation and answer with no instruction. What sort of things have LLMs reconstructed after being trained on trillions of tokens of data ?

  • > 2. People need to let go of this strange and erroneous idea that humans somehow have this privileged access to 'the real world'. You don't.

    You are denouncing a claim that the comment you're replying to did not make.

    • They made it implicitly, otherwise this:

      >(2) language only somewhat models the world

      is completely irrelevant.

      Everyone is only 'somewhat modeling' the world. Humans, Animals, and LLMs.

      15 replies →

  • A language model in computer science is a model that predicts the probability of a sentence or a word given a sentence. This definition predates LLMs.

    • A 'language model' only has meaning in so far as it tells you this thing 'predicts natural language sequences'. It does not tell you how these sequences are being predicted or any anything about what's going on inside, so all the extra meaning OP is trying to place by calling them Language Models is well...misplaced. That's the point I was trying to make.

> This matters because (1) the world cannot be modeled anywhere close to completely with language alone

LLMs being "Language Models" means they model language, it doesn't mean they "model the world with language".

On the contrary, modeling language requires you to also model the world, but that's in the hidden state, and not using language.

  • Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.

    LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).

    LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.

    So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.

    Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.

    • I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?

      But I know very little about this.

      2 replies →

    • They do not model the world.

      They present a statistical model of an existing corpus of text.

      If this existing corpus includes useful information it can regurgitate that.

      It cannot, however, synthesize new facts by combining information from this corpus.

      The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.

      32 replies →

Modern LLMs are large token models. I believe you can model the world at a sufficient granularity with token sequences. You can pack a lot of information into a sequence of 1 million tokens.

Large Language Models is a misnomer- these things were originally trained to reproduce language, but they went far beyond that. The fact that they're trained on language (if that's even still the case) is irrelevant- it's like claiming that student trained on quizzes and exercise books are only able to solve quizzes and exercises.

  • It isn't a misnomer at all, and comments like yours are why it is increasingly important to remind people about the linguistic foundations of these models.

    For example, no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it. The reading can certainly help, at least in theory, but, in practice, is not necessary and may even hurt (if it makes certain processes that need to be unconscious held too strongly in consciousness, due to the linguistic model presented in the book).

    This is why LLMs being so strongly tied to natural language is still an important limitation (even it is clearly less limiting than most expected).

    • > no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it

      This is like saying that no matter how much you know theoretically about a foreign language you still need to train your brain to talk it. It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.

      Let me try some variations: "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it". "No matter how many books you have read about quantum mechanics, you need to be a particle..."

      18 replies →

    • You and I can't learn to ride a bike by reading thousands of books about cycling and Newtonian physics, but a robot driven by an LLM-like process certainly can.

      In practice it would make heavy use of RL, as humans do.

      19 replies →

    • you are living in the past these models have been trained on image data for ages, and one interesting find was that even before that they could model aspects of the visual world astonishingly well even though not perfect just through language.

      2 replies →