← Back to context

Comment by simonw

1 day ago

It uses fineweb, which is derived from Common Crawl, which is an unlicensed scrape of web pages.

You don't need a license to scrape the public web and analyze it, turn it into tokens and other transformations. Let's not expand copyright beyond the horrible monster it already is.

  • I think it's likely that US law will continue to find training on scraped, unlicensed data to be legal.

    That doesn't mean much to the many people I know of who refuse to use a technology that they see as being unethically created using the work of others without compensating them.

    I continue to hope that someone will train a "vegan" model on licensed or out-of-copyright data so those people can experience the benefits of this class of technology.

    (I compare them to vegans because, like vegans, I think their ethical position is credible and has merit even though I do not choose the same ethical framework for myself.)

    • I don't know if it's ethically better to use LLMs trained on data licensed from X, Reddit, stackoverflow, Sony, CNN and all big content aggregators who will/have agreements with big tech. I'd prefer to focus on mechanisms to force reciprocating the donation: scrape and train at will, publish the models as open weights, at least. Anyway, the vegan LLMs exist, see the work of Pleias.ai.

    • This is as ethical as it gets. They're getting compensated by being able to use the result of their work freely. This is the rising tide that lifts all boats.

      1 reply →