Comment by D-Machine
18 days ago
It isn't a misnomer at all, and comments like yours are why it is increasingly important to remind people about the linguistic foundations of these models.
For example, no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it. The reading can certainly help, at least in theory, but, in practice, is not necessary and may even hurt (if it makes certain processes that need to be unconscious held too strongly in consciousness, due to the linguistic model presented in the book).
This is why LLMs being so strongly tied to natural language is still an important limitation (even it is clearly less limiting than most expected).
> no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it
This is like saying that no matter how much you know theoretically about a foreign language you still need to train your brain to talk it. It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.
Let me try some variations: "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it". "No matter how many books you have read about quantum mechanics, you need to be a particle..."
> It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.
To the contrary, this is purely speculative and almost certainly wrong, riding a bike is co-ordinating the realtime circuits in the right way, and language and a linguistic model fundamentally cannot get you there.
There are plenty of other domains like this, where semantic reasoning (e.g. unquantified syllogistic reasoning) just doesn't get you anywhere useful. I gave an example from cooking later in this thread.
You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition. Talk to e.g. actual mathematicians, and they will generally tell you they may broadly recruit visualization, imagined tactile and proprioceptive senses, and hard-to-vocalize "intuition". One has to claim this is all epiphenomenal, or that e.g. all unconscious thought is secretly using language, to think that all modeling is fundamentally linguistic (or more broadly, token manipulation). This is not a particularly credible or plausible claim given the ubiquity of cognition across animals or from direct human experiences, so the linguistic boundedness of LLMs is very important and relevant.
Funny, because riding a bicycle or speaking a language is exactly something people don't have a world model of. Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue. "Making the right movement at the right time within a narrow boundary of conditions" is a world model, or is it just predicting the next move?
> You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition.
I'm not saying that at all. I am saying that any (sufficiently long, varied) coherent speech needs a world model, so if something produces coherent speech, there must be a world model behind. We can agree that the model is lacking as much as the language productions are incoherent: which is very little, these days.
14 replies →
> "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it"
Every single time I travel somewhere new, whatever research I did, whatever reviews or blogs I read or whatever videos I watched become totally meaningless the moment I get there. Because that sliver of knowledge is simply nothing compared to the reality of the place.
Everything you read is through the interpretation of another person. Certainly someone who read a lot of books about ancient history can talk about it - but let's not pretend they have any idea what it was actually like to live there.
So you're saying that every time we talk about anything we don't have direct experience of (the past, the future, places we haven't been to, abstract concepts, etc.) we are exactly in the same position as LLMs are now- lacking a real world model and therefore unintelligent?
You and I can't learn to ride a bike by reading thousands of books about cycling and Newtonian physics, but a robot driven by an LLM-like process certainly can.
In practice it would make heavy use of RL, as humans do.
> In practice it would make heavy use of RL, as humans do.
Oh, so you mean, it would be in a harness of some sort that lets it connect to sensors that tell it things about its position, speed, balance and etc? Well, yes, but then it isn't an LLM anymore, because it has more than language to model things!
Can’t we claim the sensor data (x=5,y=9…) is text too.
Not sure if it’s great just plain text, but would be better if could understand the position internally somehow.
3 replies →
I have no idea why you used the word “certainly” there.
What is in the nature of bike-riding that cannot be reduced to text?
You know transformers can do math, right?
12 replies →
you are living in the past these models have been trained on image data for ages, and one interesting find was that even before that they could model aspects of the visual world astonishingly well even though not perfect just through language.
Counterpoint: Try to use an LLM for even the most coarse of visual similarity tasks for something that’s extremely abundant in the corpus.
For instance, say you are a woman with a lookalike celebrity, someone who is a very close match in hair colour, facial structure, skin tone and body proportions. You would like to browse outfits worn by other celebrities (presumably put together by professional stylists) that look exactly like her. You ask an LLM to list celebrities that look like celebrity X, to then look up outfit inspiration.
No matter how long the list, no matter how detailed the prompt in the features that must be matched, no matter how many rounds you do, the results will be completely unusable, because broad language dominates more specific language in the corpus.
The LLM cannot adequately model these facets, because language is in practice too imprecise, as currently used by people.
To dissect just one such facet, the LLM response will list dozens of people who may share a broad category (red hair), with complete disregard to the exact shade of red, whether or not the hair is dyed and whether or not it is indeed natural hair or a wig.
The number of listicles clustering these actresses together as redheads will dominate anything with more specific qualifiers, like ’strawberry blonde’ (which in general counts as red hair), ’undyed hair’ (which in fact tends to increase the proportion of dyed hair results, because that’s how linguistic vector similarity works sometimes) and ’natural’ (which again seems to translate into ’the most natural looking unnatural’, because that’s how language tends to be used).
You've clearly never read an actual paper on the models and understand nothing about backbones, pre-training, or anything I've said in my posts in this thread. I've made claims far more specific about the directionality of information flow in Large Multimodal Models, and here you are just providing generic abstract claims far too vague to address any of that. Are you using AI for these posts?