Comment by ivan_gammel

3 months ago

I think it’s a bit different. Evolution did not give us the dataset. It helped us to establish the most efficient training path, and the data, the enormous volume of it starts coming immediately after birth. Humans learn continuously through our senses and use sleep to compress the context. The amount of data that LLMs receive only appears big. In our first 20 years of life we consume by at least one order of magnitude more information compared to training datasets. If we count raw data, maybe 4-5 orders of magnitude more. It’s also different kind of information and probably much more complex processing pipeline (since our brain consciously processes only a tiny fraction of input bandwidth with compression happening along the delivery channels), which is probably the key to understanding why LLMs do not perform better.

2 comments

ivan_gammel

saberience 3 months ago

Sorry but this is patently rubbish, we do not consume orders of magnitude more data than the training datasets, nor do we "process" it in anything like the same way.

Firstly, most of what we see, hear, experience etc, is extremely repetitive. I.e. for the first several years of our live we see the same people, see the same house, repeatedly read the same few very basic books, etc etc. So, you can make this argument purely based on "bytes" of data. I.e. humans are getting this super HD video feed, which means more data than an LLM. Well, we are getting a "video feed" but mostly of the same walls in the same room, which doesn't really mean much of anything at all.

Meanwhile, LLMs are getting LITERALLY, all of humanities recorded textual knowledge, more recorded audio than 10000 humans could listen to in their lifetime, more images and more varied images than a single person could view in their entire life, reinforcement learning on the hardest maths, science, and programming questions etc.

The idea that because humans are absorbing "video" means that its somehow more "data" than frontier LLMs are trained with is laughable honestly.

ivan_gammel 3 months ago

I like your confidence, but I think you missed a few things here and there.
Training datasets are repetitive too. Let’s say, you feed some pretty large code bases to an LLM: how many times there will be a for loop? Or how many times Newton laws (or any other important ideas) are mentioned there? Not once, not two times, but many more. How many times you will encounter a description of Paris, London or St.Petersburg? If you eliminate repetition, how much data will actually be left there? And what’s the point anyway: this repetition is required part of the training, because it places that data in context, linking it to everything else.
Is repetition that we have in our sensory inputs really different? If you had children or had opportunity to observe how do they learn, they are never confined in the same static repetition cycle. They experience things again and again in a dynamic environment that evolves over time. When they draw a line, they get instant feedback and learn from it, so that next line is different. When they watch something on TV for fifth time, they do not sit still, they interact — and learn, through dancing, repeating phrases and singing songs. In a familiar environment that they have seen so many times, they notice subtle changes and ask about them. What was that sound? What was that blinking light outside? Who just came in and what’s in that box? Our ability to analyze and generalize probably comes from those small observations that happen again and again.
Even more importantly, when nothing is changing, they learn through getting bored. Show me an LLM that can get bored when digging through another pointless conversation on Reddit. When sensory inputs do not bring anything valuable, children learn to compensate through imagination and games, finding the ways to utilize those inputs better.
You measure quality of data using wrong metrics. The intelligence is not defined by the number of known facts, but by the ability to adapt and deal with the unknown. The inputs that humans use prepare us for that better than all written knowledge of the world available to LLM.