Comment by namlem
3 days ago
It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
3 days ago
It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
Do you have a reason to believe this ain't already being done? I would assume that the big guys like openai are already training on basically all text in existence.
In fact, facebook torrented annas archive and got busted for it, because of course they did:
https://torrentfreak.com/meta-torrented-over-81-tb-of-data-t...
Every LLM maker probably did the same. Facebook just has disgruntled employees who leaked it
4 replies →
Wasn't this confirmed what Meta does?
https://www.forbes.com/sites/danpontefract/2025/03/25/author...
> Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.
Yet some people still believe "it would have to be done in evil Russia".
OP does have an exaggerated statement - its not like there aren't laws in Russia or something and I largely agree with your sentiment. I think there are levels to this though and its pretty clear that Russia is much riskier than the USA when it comes to IP - just look up anything to do with insuring IP risk in Russia (here's one such example: https://baa.no/en/articles/i-have-ip-in-russia-is-my-ip-at-r...)
Also according to the office of US trade representative, Russia is on the priority watch list of countries that do not respect IP [1] and post 2022, largely due to the war, Russia implemented measures negatively effecting IP rights. [2,3]
If you think it isn't the case and Russia is just as risky as the US when it comes to copyright and IP, I would be interested to know why.
1. https://ustr.gov/about/policy-offices/press-office/press-rel... 2. https://www.papula-nevinpat.com/executive-summary-the-ip-sit... 3. https://www.taftlaw.com/news-events/law-bulletins/russia-iss...
> evil
In this case and context, a label like "evil" is a twisted interpretation.
> or some other country that doesn't respect international copyright though.
Like the US? OpenAI et al. don't give a shit.
There's a difference between feeding massive amounts of copyrighted material to a training process that blends them thoroughly and irreversibly, and doing all that in-house, vs. offering people a service that indexes (and possibly partially rehosts) that material, enabling and encouraging users to engage directly in pirating concrete copyrighted works.
Ironically the low tech infringing proposal would lead to more reliable results grounded in the raw contents of the data, using less computing/power and without the confidently incorrect sycophanty we see from the LLMs.
1 reply →
There's this famous phrase in Russian that was born out of a short interview with a woman, a strong Putin supporter, that's often been used as a sarcastic remark for pointing out someone's double standards and/or hypocrisy.
It can be roughly translated to "you don't understand, it's a completely different situation". That's what's constantly on my mind when I'm reading discussions like this one.
Everybody and their dog torrenting petabytes of data and getting away with it (Meta is the only one that got caught and they've still gotten away with doing it)?
The very same data poor American students were forced to commit suicide over? The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?
Yet when giant corporations are doing the exact same thing on a massive scale, it's fine? It's not even the same thing, an American student torrenting books isn't making any money off it, while Meta very much is.
Of course it's not the same, a simple-minded and poorly educated person like me isn't capable of understanding the difference. You keep believing in your moral superiority, the rest of the world has finally woken up.
5 replies →
That's Uber's Gambit. Nothing is illegal for large enough corporations with strong network effects and deep pockets.
1 reply →
> that blends them thoroughly and irreversibly
It's okay, you can say 'laundering'
1 reply →
> > or some other country that doesn't respect international copyright though.
> Like the US? OpenAI et al. don't give a shit.
OpenAI is not a country and therefore cannot make laws that don't respect international (or domestic) copyright. Also the US is a lot bigger than OpenAI and the big tech corps, and the law is very much on the side of copyright holders in the US.
> the law is very much on the side of copyright holders in the US.
Remind me again what the status of the case is with Meta/Facebook using pirated material to train their proprietary LLMs, and even seeding the data back to the community while downloading it?
1 reply →
The money is definitely in the side of big tech vs book publishers. There may be a nominal settlement to end the matter, perhaps after a decade of litigation
LLMs already use it, dude )
I think one use would be to search for information directly from a book, rather than get a garbled/half-hallucinated version of it.
You don't need AI for that. I get the optimistic spirit of what you mean though.
1 reply →
garbled/half-hallucinated is probably what you would've gotten 8-12mo ago but now adays im sure with good prompting you can pull value from any book.