Comment by miohtama
11 days ago
Google also wins with Google Books, as other Western companies cannot get training material in the same scale. Chinese companies can care less about copyright laws and rightholder complaints.
11 days ago
Google also wins with Google Books, as other Western companies cannot get training material in the same scale. Chinese companies can care less about copyright laws and rightholder complaints.
Google's advantage is mostly in historical books. Google Books has a great collection going back to the 1500s.
For modern works anyone can just add Z-Library and Anna's Archive. Meta got caught, but I doubt they were the only ones (in fact ElutherAI famously included the pirated Books3 dataset in their openly published dataset for GPT-Net and GPT-J and nothing really bad happened)
Anthropic has apparently gone and redone the Google books thing, buying a copy of every book and scanning it (per a ruling in a recent lawsuit against them).