Comment by orbital-decay
4 days ago
>You could actually wonder that one possible explanation for the human sample efficiency that needs to be considered is evolution. Evolution has given us a small amount of the most useful information possible.
It's definitely not small. Evolution performed a humongous amount of learning, with modern homo sapiens, an insanely complex molecular machine, as a result. We are able to learn quickly by leveraging this "pretrained" evolutionary knowledge/architecture. Same reason as why ICL has great sample efficiency.
Moreover, the community of humans created a mountain of knowledge as well, communicating, passing it over the generations, and iteratively compressing it. Everything that you can do beyond your very basic functions, from counting to quantum physics, is learned from the 100% synthetic data optimized for faster learning by that collective, massively parallel, process.
It's pretty obvious that artificially created models don't have synthetic datasets of the quality even remotely comparable to what we're able to use.
I think it’s a bit different. Evolution did not give us the dataset. It helped us to establish the most efficient training path, and the data, the enormous volume of it starts coming immediately after birth. Humans learn continuously through our senses and use sleep to compress the context. The amount of data that LLMs receive only appears big. In our first 20 years of life we consume by at least one order of magnitude more information compared to training datasets. If we count raw data, maybe 4-5 orders of magnitude more. It’s also different kind of information and probably much more complex processing pipeline (since our brain consciously processes only a tiny fraction of input bandwidth with compression happening along the delivery channels), which is probably the key to understanding why LLMs do not perform better.
Sorry but this is patently rubbish, we do not consume orders of magnitude more data than the training datasets, nor do we "process" it in anything like the same way.
Firstly, most of what we see, hear, experience etc, is extremely repetitive. I.e. for the first several years of our live we see the same people, see the same house, repeatedly read the same few very basic books, etc etc. So, you can make this argument purely based on "bytes" of data. I.e. humans are getting this super HD video feed, which means more data than an LLM. Well, we are getting a "video feed" but mostly of the same walls in the same room, which doesn't really mean much of anything at all.
Meanwhile, LLMs are getting LITERALLY, all of humanities recorded textual knowledge, more recorded audio than 10000 humans could listen to in their lifetime, more images and more varied images than a single person could view in their entire life, reinforcement learning on the hardest maths, science, and programming questions etc.
The idea that because humans are absorbing "video" means that its somehow more "data" than frontier LLMs are trained with is laughable honestly.
I like your confidence, but I think you missed a few things here and there.
Training datasets are repetitive too. Let’s say, you feed some pretty large code bases to an LLM: how many times there will be a for loop? Or how many times Newton laws (or any other important ideas) are mentioned there? Not once, not two times, but many more. How many times you will encounter a description of Paris, London or St.Petersburg? If you eliminate repetition, how much data will actually be left there? And what’s the point anyway: this repetition is required part of the training, because it places that data in context, linking it to everything else.
Is repetition that we have in our sensory inputs really different? If you had children or had opportunity to observe how do they learn, they are never confined in the same static repetition cycle. They experience things again and again in a dynamic environment that evolves over time. When they draw a line, they get instant feedback and learn from it, so that next line is different. When they watch something on TV for fifth time, they do not sit still, they interact — and learn, through dancing, repeating phrases and singing songs. In a familiar environment that they have seen so many times, they notice subtle changes and ask about them. What was that sound? What was that blinking light outside? Who just came in and what’s in that box? Our ability to analyze and generalize probably comes from those small observations that happen again and again.
Even more importantly, when nothing is changing, they learn through getting bored. Show me an LLM that can get bored when digging through another pointless conversation on Reddit. When sensory inputs do not bring anything valuable, children learn to compensate through imagination and games, finding the ways to utilize those inputs better.
You measure quality of data using wrong metrics. The intelligence is not defined by the number of known facts, but by the ability to adapt and deal with the unknown. The inputs that humans use prepare us for that better than all written knowledge of the world available to LLM.
I think the important part in that statement is the "most useful information", the size itself is pretty subjective because it's such an abstract notion.
Evolution gave us very good spatial understanding/prediction capabilities, good value functions, dexterity (both mental and physical), memory, communication, etc.
> It's pretty obvious that artificially created models don't have synthetic datasets of the quality even remotely comparable to what we're able to use.
This might be controversial, but I don't think the quality or amount of data matters as much as people think if we had systems capable of learning similar enough to the way human's and other animals do. Much of our human knowledge has accumulated in a short time span, and independent discovery of knowledge is quite common. It's obvious that the corpus of human knowledge is not a prerequisite of general intelligence, yet this corpus is what's chosen to train on.
Please stop comparing these things to biological systems. They have very little in common.
I'm talking about any processes that can be vaguely described as learning/function fitting, and share the same general properties with any other learning. Not just biological processes, e.g. human distributed knowledge distillation process is purely social.
Structurally? Yes.
On the other hand, outputs of these systems are remarkably close to outputs of certain biological systems in at least some cases, so comparisons in some projections are still valid.
That's like saying that a modern calculator and a mechanical arithmometer have very little in common.
Sure, the parts are all different, and the construction isn't even remotely similar. They just happen to be doing the same thing.
But they just don't happen to be doing the same thing. People claiming otherwise have to first prove that we are comparing the same thing.
This whole strand of “inteligence is just a compression” may be possible but it's just as likely (if not a massively more likely) that compression is just a small piece or even not at all how biological inteligence works.
In your analogy it's more like comparing modern calculator to a book. They might have same answers but calculator gets to them through completely different process. The process is the key part. I think more people would be excited by a calculator that only counts till 99 than a super massive book that has all the math results ever produced by the human kind.
1 reply →
They are doing "the same thing" only from the point of view of function, which only makes sense from the point of view of the thing utilizing this function (e.g. a clerical worker that needs to add numbers quickly).
Otherwise, if "the parts are all different, and the construction isn't even remotely similar", how can the thing they're doing be "the same"? More importantly, how is it possible to make useful inferences about one based on the other if that's the case?
10 replies →
If we think of every generation as a compression step of some form of information into our DNA and early humans existed for ~1.000.000 years and a generation is happening ~20years on average, then we have only ~50.000 compression steps to today. Of course, we have genes from both parents so they is some overlap from others, but especially in the early days the pool of other humans was small. So that still does not look like it is on the order of magnitude anywhere close to modern machine learning. Sure, early humans had already a lot of information in their DNA but still
It only ends up in the DNA if it helps reproductive success in aggregate (at the population level) and is something that can be encoded in DNA.
Your comparison is nonsensical and simultaneously manages to ignore the billion or so years of evolution starting from the first proto-cell with the first proto-DNA or RNA.
Aren't you agreeing with his point?
The process of evolution distilled down all that "humongous" amount to what is most useful. He's basically saying our current ML methods to compress data into intelligence can't compare to billions of years of evolution. Nature is better at compression than ML researchers, by a long shot.
>Aren't you agreeing with his point? ... Nature is better at compression than ML researchers, by a long shot.
What I mean is basically the opposite. Nature not better as in more efficient. It just had a lot more time and scale to do it in an inefficient way. The reason we're learning quickly is that we can leverage that accumulated knowledge, in a manner similar to in-context learning or other multi-step learning (bulk of the training forms abstractions which are then used by the next stage). It's really unlikely we have some magical architecture that is fundamentally better than e.g. transformers or any other architecture at sample efficiency while having bad underlying data. My intuition is there might even be a hard limit to that. Multi-stage bootstrap might be the key, not the architecture.
Same for the social process of knowledge transfer/compression.
Sample efficiency isnt the ability to distill alot of data into good insights. Its the ability to get good insights from less data. Evolution didnt do that it had a lot of samples to get to where it did
> Sample efficiency isnt the ability to distill alot of data into good insights
Are you claiming that I said this? Because I didn't....
There's two things going on.
One is compressing lots of data into generalizable intelligence. The other is using generalized intelligence to learn from a small amount of data.
Billions of years and all the data that goes along with it -> compressed into efficient generalized intelligence -> able to learn quickly with little data
1 reply →