← Back to context

Comment by freetime2

11 hours ago

So it sounds like they definitely scraped the content and used it for training, which is legal:

> Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI development in the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner.”

The article is almost completely lacking in details though about how the information was reproduced/distributed to the public. It could be a very cut-and-dry case where the model would serve up the entire article verbatim. Or it could be a much more nuanced case where the model will summarize portions of an article in its own words. I would need to read up on Japanese copyright law, as well as see specific examples of infringement, to be able to make any sort of conclusion.

It seems like a lot of people are very quick to jump to conclusions in the absence of any details, though, which I find frustating.

> So it sounds like they definitely scraped the content and used it for training, which is legal

It certainly seems legal to train. But the case is about scraping without permission. Does downloading an article from a website, probably violating some small print user agreement in the process, count as distribution or reproduction? I guess the court will decide.

  • According to the article, they are complaining that the downloaded content had "been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries." Derived works.

  • Generally the court practice so far was that if you don't register or login, you never accept the user agreement. If the website is still willing to serve content to non-registred users, you're free to archive it. How you can use it afterwards is a separate question.

LLMs are able to reproduce the entire IP. Sometimes it requires more than one prompt. I've seen examples in the wild where a single prompt was sufficient:

https://jskfellows.stanford.edu/theft-is-not-fair-use-474e11...

Therefore, their output is a derivative work and violates copyright. The 2018 amendment is driven by big capital and should be reverted. Machines can plagiarize at huge scale and should have have no human rights.

  • I'm aware of the fact that LLMs can reproduce IP used in training data, and consider the example NYT article in your link to be "a very cut-and-dry case" of copyright infringment. And commercial AI companies especially should be held liable for damages if they can't or won't implement effective guardrails to prevent this from happening.

    I'm somewhat optimistic this problem can be solved, though, with filters and usage policies. YouTube, another platform with basically unlimited potential for copyright infringement, has managed to implement a system that is good enough at preventing infringement to keep lawsuits at bay.

    It's also not clear if that's what Yomiuri Shimbun is alleging here. In their 2023 "Opinion on the Use of News Content by Generative AI" [1] they give this example:

    > Newspaper companies have long provided databases containing past newspaper pages and articles for a fee, and in recent years, they have also sold article data for AI development. If AI imports large quantities of articles, photos, images, and other data from news organizations’ digital news sites without permission, commercial AI services for third parties developing it could conflict with the existing database sales market and “unreasonably prejudice the interests of the copyright owner” (Article 30-4 of the Act). Also, even if all or part of a particular article communicates nothing further than facts and hardly constitutes a copyright, many contents deserve legal protection because of the effort and cost invested by the newspaper companies. Even if an AI collects and uses only the factual part, it does not mean it will always be legal.

    So basically arguing that 2018 amendment which allows the use of copyrighted works to train AI models without permission from the copyright holder is not applicable because the use would "would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation". [2]

    ... which I think is a much more nuanced argument. I don't think we can just lump all of these cases together and say "it's infringement" or "it's fair use" without actually considering the details in each case. Or the specific laws in each country.

    [1] https://www.pressnet.or.jp/statement/20230517_en.pdf

    [2] https://www.cric.or.jp/english/clj/cl2.html