Comment by Joel_Mckay
1 day ago
> 1. I scrape the whole internet onto my disk.
This is illegal under theft-of-service laws, and a violation of most sites terms-of-service. If these spider scapers respected the robot exclusion standard under its intended use-case for search-engines, than getting successfully sued for overt copyright piracy and quietly settling for billions would seem unfair.
Note too, currently >52% of the web is LLM generated slop, so any model trained on that output will inherit similar problems.
> 2. I go through the text, and gather every word bigram, and build a frequency table.
And when (not if) a copyrighted work is plagiarized without citation it is academic misconduct, IP theft, and an artistic counterfeit. Copyright law is odd, and often doesn't make a distinction about the origin of similar works. Note this part of the law was recently extended to private individuals this year:
"OpenAI Stole Scarlet Johansson's Voice"
https://www.youtube.com/watch?v=YhgYMH6n004
> 3. I delete everything I scraped.
This doesn't matter if the output violates copyright. Images in jpeg format are compressed in the frequency domain, have been around for ages, and still get people sued or stuck in jail regularly.
Academic evaluation usually does fall under a fair-use exception, but the instant someone sells or uses IP in some form of trade/promotion it becomes a copyright violation.
> 4. I use that frequency table
See above, the how it is made argument is 100% BS. The statistical salience of LLM simply can't prevent plagiarism and copyright violations. This was cited in the original topic links.
> 5. I profit from this text generator.
Since this content may inject liabilities into commercial settings, only naive fools will use this in a commercial context. Most "AI" companies lose around $4.50 per new customer, and are a economic fiction driven by some very silly people.
LLM businesses are simply an unsustainable exploit. Unfortunately they also proved wealthy entities can evade laws through regulatory capture, and settling the legal problems they couldn't avoid.
I didn't make the rules, but do disagree cleverness supersedes a just rule of law. Have a wonderful day =3
No comments yet
Contribute on Hacker News ↗