Comment by simonw

20 hours ago

It uses fineweb, which is derived from Common Crawl, which is an unlicensed scrape of web pages.

2 comments

simonw

You don't need a license to scrape the public web and analyze it, turn it into tokens and other transformations. Let's not expand copyright beyond the horrible monster it already is.

simonw 5 hours ago

I think it's likely that US law will continue to find training on scraped, unlicensed data to be legal.
That doesn't mean much to the many people I know of who refuse to use a technology that they see as being unethically created using the work of others without compensating them.
I continue to hope that someone will train a "vegan" model on licensed or out-of-copyright data so those people can experience the benefits of this class of technology.
(I compare them to vegans because, like vegans, I think their ethical position is credible and has merit even though I do not choose the same ethical framework for myself.)