Comment by pjc50
7 months ago
Someone did a crude estimation dividing the value of OpenAI by the number of books plagiarized into it, and came up with an estimate of the order of $500k per book.
Of course, none of that vast concentration of investor money will go to the authors.
If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.
> Of course, none of that vast concentration of investor money will go to the authors.
There's no reason it should. The authors don't get perpetual royalties from everyone who read their works. Or do you believe I should divide my salary between Petzold, Stroustrup, Ousterhout, Abelson, Sussman, Norvig, Cormen, and a dozen other technical authors, and also between all HN users proportionally to their comment count or karma?
Should my employer pay them as well, and should their customers too, because you can trace a causal chain from some products to the people mentioned, through me?
IP, despite its issues, does not work like that.
> If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.
Or call it the public education system and public library network.
> public education system and public library network
Public libraries do pay reader royalties.
I don't know, I've been on the side of weaker copyright; Aaron Schwartz was driven to suicide, sci-hub is one of the most blocked sites on the Internet. But now it turns out that IP is simply a matter of power. There isn't really a difference between sci-hub / libgen and the scraped training databases other than having money, which suddenly means the rules don't apply.
If you go that route and throw all conventions overboard, there is no reason why Microsoft and OpenAI shouldn't be nationalized. Without compensation.
You, know, for the "benefit of society", as these companies never tire of saying.
What conventions?
It's pretty clear to me. The authors of books "plagiarized" into the training corpus are at best entitled to one-time payment equivalent to the company buying those books. They're not entitled to percentage of profits generated by the model. Can't think of any convention that would even remotely imply that.
(I suppose it depends on whether you see the training process more like model learning, vs. more like model being a derived work. The latter feels absurd to me.)
As for OpenAI, et al. - they're selling a service that provides value to people. That's pretty much the most basic business scenario, far more honest than most of the tech industry. And they did create the thing providing value. The training data may be a critical ingredient, but only when collected and aggregated at scale, thoroughly blended, distilled down to explicit and implicit semantics, and solidified into a model than then gets served via complex piece of computational infrastructure - all of that is what the companies are doing, all of that is what's critical to providing this fundamentally new kind of value. It's only fair they should be compensated for that.
And to be clear - despite their occasional protestations to the contrary, I don't believe OpenAI, Microsoft, Google and other LLM vendors to be working for the "benefit of society" or "good of humanity". I claim that LLMs as models and as a technology are a huge value to humanity. Companies come and go, business models change, but inventions remain. Even today, between DeepSeek-R1, newest LLama models and countless of their derivatives, society can enjoy the benefits of near-SOTA LLMs without being beholden to a few large tech companies. The models and means to run them are out there, and are not going away.
2 replies →
Do you happen to remember if that crude estimate assumed that only book authors should get paid, or if this was "total of x tokens, of which y are books, the books are of average length z"?