← Back to context

Comment by reedciccio

9 hours ago

You don't need a license to scrape the public web and analyze it, turn it into tokens and other transformations. Let's not expand copyright beyond the horrible monster it already is.

I think it's likely that US law will continue to find training on scraped, unlicensed data to be legal.

That doesn't mean much to the many people I know of who refuse to use a technology that they see as being unethically created using the work of others without compensating them.

I continue to hope that someone will train a "vegan" model on licensed or out-of-copyright data so those people can experience the benefits of this class of technology.

(I compare them to vegans because, like vegans, I think their ethical position is credible and has merit even though I do not choose the same ethical framework for myself.)

  • I don't know if it's ethically better to use LLMs trained on data licensed from X, Reddit, stackoverflow, Sony, CNN and all big content aggregators who will/have agreements with big tech. I'd prefer to focus on mechanisms to force reciprocating the donation: scrape and train at will, publish the models as open weights, at least. Anyway, the vegan LLMs exist, see the work of Pleias.ai.

  • This is as ethical as it gets. They're getting compensated by being able to use the result of their work freely. This is the rising tide that lifts all boats.