Comment by simonw

1 day ago

It uses fineweb, which is derived from Common Crawl, which is an unlicensed scrape of web pages.

5 comments

simonw

You don't need a license to scrape the public web and analyze it, turn it into tokens and other transformations. Let's not expand copyright beyond the horrible monster it already is.

simonw 7 hours ago
I think it's likely that US law will continue to find training on scraped, unlicensed data to be legal.
That doesn't mean much to the many people I know of who refuse to use a technology that they see as being unethically created using the work of others without compensating them.
I continue to hope that someone will train a "vegan" model on licensed or out-of-copyright data so those people can experience the benefits of this class of technology.
(I compare them to vegans because, like vegans, I think their ethical position is credible and has merit even though I do not choose the same ethical framework for myself.)
- reedciccio 12 minutes ago
  
  I don't know if it's ethically better to use LLMs trained on data licensed from X, Reddit, stackoverflow, Sony, CNN and all big content aggregators who will/have agreements with big tech. I'd prefer to focus on mechanisms to force reciprocating the donation: scrape and train at will, publish the models as open weights, at least. Anyway, the vegan LLMs exist, see the work of Pleias.ai.
- EnergyAmy 1 hour ago
  
  This is as ethical as it gets. They're getting compensated by being able to use the result of their work freely. This is the rising tide that lifts all boats.
  
  1 reply →