Comment by islandfox100

5 hours ago

Then it should be simple for one of the frontier labs to produce a model trained only on private data. We haven't seen that.

Didn't the famous "Textbooks are all you need" paper already proof that point three years ago?

Sure, we ask a lot more of modern models, but private training data also got a lot better. You would loose out on a lot of long-tail knowledge, but that can be fixed with web search tools. You'd limit the styles, dialects and colloquial phrases the model understands and can use, but for many use cases that would be fine

But why would any frontier lab do that? Throwing in more training data still leads to better results in pretraining. And showing that they don't need to hoover up the internet and Anna's Archive only empowers regulators to prevent them from doing that

  • Maybe I am missing your point but "Textbooks are all you need" distilled from GPT-3.5