Comment by coderenegade

8 hours ago

The distillation risk has been brewing for a while now. In a very real sense, the model is the data, so if the data is locked down because of how valuable it is, it was only a matter of time before fully open access to the models would be revoked.

There's also an additional economic concern that rarely gets mentioned: because no one has cracked continual learning, keeping models up-to-date and filling in gaps in performance requires retraining on an ever growing dataset. Granted, you aren't starting from scratch each time, but the scaling required just to stay relevant looks daunting.

I don't know where any this goes on a societal level, but I've believed since the release of deepseek r1 that access to frontier models would eventually be locked up behind contracts, since the only moats protecting the models themselves are purely artificial. It remains to be seen how effective China is at pushing the envelope, and whether they are interested in providing unfettered access. And on top of that, it remains to be seen how well these models actually turn out to scale in the long run.

3 comments

coderenegade

ehnto 5 hours ago

They are also not getting the same quantity or quality of data as was possible in the first years of "ingest". Compared to the beginning, from here on it is more like a drip feed of new training data. Still immense volumes of data, but we are talking 1 year of data production from society versus centuries of text and data ingested in a short time frame.

nayroclade 3 hours ago

For pre-training, yes. But for post-training you need high-quality labelled datasets for reinforcement learning. So far AI has been most successful in coding, because you can translate the usage into such datasets, and thus produce a virtuous cycle: More usage produces more data, which produces better models, which drives more usage.
The question is whether this same model can successfully be applied in disciplines like medicine, law, engineering, etc.

BrtByte 8 hours ago

This is a good point, especially the "model is the data" framing