Comment by adamtaylor_13
2 days ago
The argument that LLMs are "feeding me back free and open internet" seems to skip the most useful aspect of the tool.
I could never, as an individual read, let alone synthesize and make decisions with, the amount of information on the internet. The LLM takes that free and open information and feeds me back novel information based on that free information. It gives me ideas, opinions, and hard data based on that information.
It's the most powerful information synthesizing tool in existence. I don't find the argument that "it's built on free information and sold to you" fair or plausible at all.
It's like saying you're free to make your own bottled water. Technically true, but in reality not.
I think you undervalue the contribution of internet-scale data to foundation modeling, and because LLMs can obsolete the content they required, I think its fair to characterize it as theft. Obviously RL contributes a lot to capabilities, but the judgement that an LLM uses to 'synthesize information' is born from the training data. The scale of the data really is beyond intuition. books3, for example, would 230 yrs of continuous reading
I actually think the "proprietary non-determenistic database of the free internet" does a lot to characterize the capabilities and effects to a lot of people. Obviously coders are more in tune with how well agents can work, but that's also due more to the RL breakthroughs than foundation modeling.
As I understand RL makes foundation models stupider (less capable, not more) but better at following instructions.
Can you steal something that is free and openly available?
I just don't understand this argument. "Theft" feels like a nice, heavy, moral accusation to toss at those you're debating with, but the actual prerequisites for theft don't even exist in this situation.
It is a lot more complicated than that. Your content is not simply used, copied, or even just simply distributed. The very terrain that you produce, distribute, represent your content has shifted due to the mechanics of it. Anything you produce is grabbed into AI summaries. They're grabbed into the training data. Humans produce free/open materials for many reasons. A lot of them don't have room to breathe and gain structure due to AI siphoning the entire atmosphere of web; eg communities
It’s the solution to the second information problem. Hypertext arose from Bush’s Memex, and the information problem it offered to solve. Now, there is simply so much information available on the modern Memex that it is impossible to make any sense of it all. So, we now have LLMs. There are still some issues with them, but they’re good at what they do.
I have mixed emotions about LLMs and AI more generally. I fear the dystopia, hope for some marginal improvement in human life, and I genuinely enjoy playing around with local models. But, I think there may be near term harms that outweigh the gains. We shall see.
Nothing about the information it feeds you is novel. It's all stolen repetition of someone else's work.
Bizarre to say that. When I have it perform work on a bespoke code base on a niche videogame, in a less commonly used language, is that still "regurgitating stuff"?
No, it is impossible for it to have seen this combination of things.
It routinely produces, suggests, and correctly implements novel things that had not existed.
You can see this yourself by learning how LLMs work, or anecdotally using these tools.
LLMs are terrible at generating code for “less commonly used languages”. They require LOTS of data for high accuracy.
I describe it this way: they are good at interpolating from what data they were trained on, but terrible at extrapolating. I agree with the parent that the LLM-generated content isn’t novel, it’s just a rehash of two things it was trained on.
1 reply →
That is simply not true. The naive “glorified auto-complete / stochastic parrot” argument may have some merit when applied to generic pre-trained models, which only learn from unsupervised next-token prediction. But the post training through reinforcement learning the frontier models undergo is very sophisticated and they genuinely learn to do novel things that are purely the work of the model being trained (and the work of the GPUs they burn along the way of course).
Thank god I bought the alphabet before learning it unlike one of those stealing heathens.
In your hate of AI please don't build the world in The Right to Read.
I'm certain I've read this comment before.