Comment by tpdly
1 day ago
I think you undervalue the contribution of internet-scale data to foundation modeling, and because LLMs can obsolete the content they required, I think its fair to characterize it as theft. Obviously RL contributes a lot to capabilities, but the judgement that an LLM uses to 'synthesize information' is born from the training data. The scale of the data really is beyond intuition. books3, for example, would 230 yrs of continuous reading
I actually think the "proprietary non-determenistic database of the free internet" does a lot to characterize the capabilities and effects to a lot of people. Obviously coders are more in tune with how well agents can work, but that's also due more to the RL breakthroughs than foundation modeling.
As I understand RL makes foundation models stupider (less capable, not more) but better at following instructions.
Can you steal something that is free and openly available?
I just don't understand this argument. "Theft" feels like a nice, heavy, moral accusation to toss at those you're debating with, but the actual prerequisites for theft don't even exist in this situation.
It is a lot more complicated than that. Your content is not simply used, copied, or even just simply distributed. The very terrain that you produce, distribute, represent your content has shifted due to the mechanics of it. Anything you produce is grabbed into AI summaries. They're grabbed into the training data. Humans produce free/open materials for many reasons. A lot of them don't have room to breathe and gain structure due to AI siphoning the entire atmosphere of web; eg communities