Comment by simonw
9 months ago
"Any information found in a web search about Newman will be available in the training set"
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
No comments yet
Contribute on Hacker News ↗