← Back to context

Comment by Salgat

6 hours ago

This is a misleading statement. The "private data" is still largely publicly produced data that has been curated through private agreements instead of scraping, such as reddit posts/comments (this is the "third-party data agreements" that companies like OpenAI mention). And yes, there is still a lot of processing done on this data, which is the norm for preparing training data.

This is doubly misleading. A lot of private data is sourced through providers like e.g. Mercor, who pay experts to answer questions and write out their reasoning. (E.g. paying a software engineer to write a project from scratch and recording every keystroke, paying a Chem PhD to answer hard Chem questions, etc.). A second source of private data comes from custom RL environments with fine-grained intermediate rewards for e.g. software engineering, financial modeling, etc.. Also, imagine the amount of usage data recorded by Claude Code, etc. Pretraining is mostly curated public data, post-training is increasingly private expert data and tests.

Source: Work at a lab, common knowledge.

  • Well since you work at a lab you should know that most capabilities arise in pretraining, not posttraining or mid training, and the latter two mostly function to bring out the hidden intelligence in these models more than anything else.

    Source: also work at a lab.

No, it isn't. The private data is largely private data, created by highly-specialized, highly-paid contracted teams of experts for domains finance, swe, consulting, etc.

Reddit data is just not that interesting, that deal is worth like $60m/year. Labs spend 10x as much on computer-use RL environments.

  • Sorry but your argument doesn't seem coherent: How is the cost of RL relevant here?

    It would also help if you could substantiate your initial claim (i.e. "internet training data is not where frontier capabilities come from")

    • RL environment (instruction, stateful container, reward function) is the training data product being bought