Comment by simonw

9 months ago

"Any information found in a web search about Newman will be available in the training set"

I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.

They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.