Japan's largest paper, Yomiuri Shimbun, sues Perplexity for copyright violations

11 hours ago (niemanlab.org)

So it sounds like they definitely scraped the content and used it for training, which is legal:

> Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI development in the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner.”

The article is almost completely lacking in details though about how the information was reproduced/distributed to the public. It could be a very cut-and-dry case where the model would serve up the entire article verbatim. Or it could be a much more nuanced case where the model will summarize portions of an article in its own words. I would need to read up on Japanese copyright law, as well as see specific examples of infringement, to be able to make any sort of conclusion.

It seems like a lot of people are very quick to jump to conclusions in the absence of any details, though, which I find frustating.

  • > So it sounds like they definitely scraped the content and used it for training, which is legal

    It certainly seems legal to train. But the case is about scraping without permission. Does downloading an article from a website, probably violating some small print user agreement in the process, count as distribution or reproduction? I guess the court will decide.

    • According to the article, they are complaining that the downloaded content had "been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries." Derived works.

    • Generally the court practice so far was that if you don't register or login, you never accept the user agreement. If the website is still willing to serve content to non-registred users, you're free to archive it. How you can use it afterwards is a separate question.

I don’t understand why corporations can violate copyright laws at hyper scale but individuals are banned from small scale piracy through authoritarian internet governance.

  • Two different issues IMO. Piracy is depriving someone of payment for an item for which payment was expected. Neither you nor Perplexity may pirate a DVD that you didn't buy.

    Copyright usually doesn't prevent copying per se, it's the redistribution that is violative. You, as well as Perplexity are free to scrape public sites. You'll both be sued if you distribute it.

  • Disclaimer: I am not a lawyer, this is just my interpretation of the situation from the comments above.

    I don't have an answer to your question, which seems more general and doesn't correspond to the situation described by the article anyway: here the corporations have the right to use copyrighted materials to train their model, in the same way that you are allowed to learn from the same materials. You might even learn it by heart if you want to, but copyright laws forbid you from reproducing it, and in this instance the Japanese law tries to follow the same principle for AI models.

    How should the corporations implement their training to prevent their models to reproduce the material verbatim is their problem, not the copyright holder's, in exactly the same fashion if you learn an article by heart, it's on you to make sure you won't recite it to the public.

  • Maxwell Tabarrok has a take on this, basically in his words:

    > The confusion of intellectual property and property rights is fair enough given the name, but intellectual property is not a property right at all. Property rights are required because property is rivalrous and exclusive: When one person is using a pair of shoes or an acre of land, other people’s access is restricted. This central feature is not present for IP: an idea can spread to an infinite number of people and the original author’s access to it remains untouched.

    > There is no inherent right to stop an idea from spreading in the same way that there is an inherent right to stop someone from stealing your wallet. But there are good reasons why we want original creators to be rewarded when others use their work: Ideas are positive externalities.

    > When someone comes up with a valuable idea or piece of content, the welfare maximizing thing to do is to spread it as fast as possible, since ideas are essentially costless to copy and the benefits are large.

    > But coming up with valuable ideas often takes valuable inputs: research time, equipment, production fixed costs etc. So if every new idea is immediately spread without much reward to the creator, people won’t invest these resources upfront, and we’ll get fewer new ideas than we want. A classic positive externalities problem.

    > Thus, we have an interest in subsidizing the creation of new ideas and content.

    And so you can reframe whether or not IP rights should be assigned in this case, based on whether you believe that the welfare generated by making AI better by providing it with content is more valuable for society than the welfare generated by subsidizing copyright holders.

    [1] https://open.substack.com/pub/maximumprogress/p/ai-copyright...

    • There's no inherent right to anything, really. The statements in whatever declaration or philosophy are just arbitrary lines. Physical property rights are just as arbitrary as the divine right if kings (and incredibly closely related when that property is inherited!)

      The argument really isn't based on rights, it's based on the rules of the game have been that people that make things get to decide what folks get to do with those things via licensing agreements, except for a very small set of carve outs that everyone knew about when they made the thing. The argument is consent. The counter argument is one/all of ai training falls under one of those carve outs, and/or it's undefined so it should default to whatever anyone wants, and/or we should pass laws that change the rules. Most of these are just as logical as if someone invented resurrection tomorrow, then murder would no longer be a crime.

    • That's the standard rubric, but it doesn't actually answer the question of differential enforcement, which comes down to the usual questions: money and power.

    • I actually think physical property rights are much more problematic than copyright.

      Works are so sparse, and there is such an explosion in how many texts there are that when someone has a right to the exclusive use of one of these huge numbers that are almost unrepresentable, you lose almost nothing.

      If someone didn't announce that they had written, let's say, Harry Potter and there was a secret law forbidding you from distributing it, that would be really bad, but it would never matter.

      Copyright infringement is a pure theft of service. You took it because it was there, because someone had already spent the effort to make it, and that was the only reason you took it.

      Land, physical property, etc. meanwhile, is something that isn't created only by human effort.

      For this reason copyright, rather than some fake pseudo-property of lower status than physical property, is actually much more legitimate than physical property.

      1 reply →

    • >the welfare generated by making AI better by providing it with content is more valuable for society than the welfare generated by subsidizing copyright holders.

      Isn't the AI in this case also copyrighted intellectual property that benefits its owners and not the society? As far as I know, Perplexity is a private, for-profit corporation.

      I don't see how improving Perplexity's proprietary models is any more beneficial to society than YouTube blocking ad blockers.

    • You should also look at the welfare generated by showing that all are equal under the law, versus showing that companies can get away with blatant lawbreaking if they can convince people that it’s for the greater good.

      The proper way to decide this would be to pass a law in the legislature. But of course our system in general and tech companies in particular don’t work that way.

  • The law only exists for those without enough money and influence to control the enforcers.

    • They don’t even need control. It’s a version of the old saying that if you owe the bank a million dollars then you have a problem, but if you owe a billion dollars then the bank has a problem. If your company is important enough then it’s not possible (at least not politically) to punish it significantly. See also: 2008 and “too big to fail.”

      7 replies →

  • > I don’t understand why corporations can violate copyright laws at hyper scale

    Can they, though? Isn't that why Perplexity is being sued?

  • Learning isn’t copying and copyright only restricts copying. Are you comparing cases where individuals distribute copies to cases where corporations are not distributing copies? The difference seems clear.

  • Both corporations and individuals are banned from piracy, but both corporations and individuals can violate copyright laws at hyper scale until somebody stops them. Corporations are probably more likely to get sued, but also more likely to get a lawyer instead of completely losing their head over a legal threat.

  • Transformative usages of copyrighted material is very different than people consuming content thr way it was meant to be consumed for free.

    • Is it? Bulk downloading of every articles of a journal is OK if I train a neural network on it later, but accessing a single one without paying is not?

      1 reply →

  • Its the same reason how Uber could run a ride service without taxi medallions and Air BnB can open home stays in your neighborhood. If there is enough money involved, the VCs in Silicon Valley know who to pay to get what they want.

  • It is because corpprattions can pay lawmakers for this, just how they did in the case of copyright law. Welcome to "democracy".

  • anthropic has lawyers and buys senators, aron swartz was one dude corporations could make an example of via the courts.

The fundamental problem is that everyone is expected to pitch in to help train these AIs, but only a handful of people benefit from it.

Japan has extremely favorable copyright laws to the holders. My understanding is that without explicit permission, there is no fair use and so any reproduction or modified work is only allowed as long as they don't request a takedown.

  • The belief that it's acceptable to copy or alter copyrighted material unless the rights holder objects is merely an assertion by those who violate copyright law. Barring a few exceptions such as citation or non-commercial use without internet distribution, you are generally prohibited from using someone else's creative work without their consent.

  • From tfa:

    > Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI development in the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner.”

    • I wonder if you can download the copyrighted material without permission though? The article specifically states 'the scraping has been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries without authorization'. They don't seem to be complaining about the training (legal), but the scraping.

    • Training a model isn't redistribution; only when you give someone a copy of the model can we think about there being a problem. At that point, you are not training, but redistributing a derived work.

    • tl;dr: If you are not directly affecting the "sales" of the product, you are good to go. But It seems perplexity did, and (as they might call it) directly trying to compete as a news source

      Personally, About their news service, Their news summarization is kinda misleading with AI hallucination in some places.

If they are copying and pasting news articles on their site, that's a pretty straightforward copyright case I would think.

In the US at least this should be pretty well covered by the case law on news aggregators.

  • It was the Yomiuri Shimbun, which boasts the world's largest circulation, that established the mass reproduction of not just article bodies, but even headlines, as a violation of copyright.

Before someone mentions Japan effectively making all data fair use for AI training, Japan specifically forbids direct recreation which is what this lawsuit is about.

> If quality news content, which underpins democracy, decreases, the public’s right to know may be hampered.

quality news content has not been a thing for a long time now, so the public will not notice any change

Original title edited for length:

> Japan’s largest newspaper, Yomiuri Shimbun, sues AI startup Perplexity for copyright violations

It's best not to crawl Japanese newspapers. Japan does not have the same kind of fair use. Even reproducing facts from a newspaper can be infringing.

  • Most of the world doesn't have fair use.

    • I suspect we'll see AI's claim to fair use be challenged even in the US. The claim to be transformative is mostly based on the "shape" of the information being delivered (i.e. the AI rephrases the information).

      However, the transformative nature of derivative work is not only about its apparence. It also factors in whether the transformation changes the nature of the message, and whether the derivative work is in direct competition with the original work [1]. I suspect for e.g. news articles, there's a good case that people get information that way instead of going to the newspaper, which means the derivative work competes with the original. Also when it comes to reporting news, there's not many ways to make the message different that doesn't make the AI service bad.

      [1]: https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...

I wish there was a open fund anyone could donate with the exclusive aim of suing Perplexity, OpenAI and others for copyright violations, where a team of lawyers would help the cases with the most likelihood to win, that would try to highlight that the way such systems are "learning" have little similitude to the intent of the law when it was written to give layaway for other artists/authors to create similar creations.

  • I wish there would be an open fund that allows me to do opposite and the fund would countersue copyright holders for holding development back and demanding excessive mafia payments

  • Amazing how many copyright maximalists there are on a site called "Hacker News."

    Seems to be a fairly recent trend. Wonder what changed.

    • What changed is that copyright violation used to be something individuals did quietly, and got punished for. Now it’s something big companies are doing openly and they’re getting tons of money for it and zero consequences.

IMO the legal system is in disarray due to extreme asymmetries in how the law is selectively applied.

First of all, the way certain platforms get sued for certain activities while others are left alone is unfair and creates significant market distortions.

Then there is the fact that wealthy individuals have much better legal representation than non-wealthy individuals.

Then there are tax loopholes which create market asymmetries above that.

The word 'fair' doesn't even make sense anymore. We've got to start asking; fair for who?

I don't know why Perplexity in particular gets everyone in a nit. It's not even particularly special: a user inputs a query, an AI model does a web search and fetches some pages on the user's behalf, and then it serves the result to the user.

Putting aside that other products, such as OpenAI's ChatGPT and modern Google Search have the same "AI-powered web search" functionality, I can't see how this is meaningfully different from a user doing a web search and pasting a bunch of webpages into an LLM chat box.

> But what about ad revenue?

The user could be using an ad blocker. If they're using Perplexity at all, they probably already are. There's no requirement for a user agent to render ads.

> But robots.txt!!!11

`robots.txt` is for recursive, fully automated requests. If a request is made on behalf of a user, through direct user interaction, then it may not be followed and IMO shouldn't be followed. If you really want to block a user agent, it's up to you to figure out how to serve a 403.

> It's breaking copyright by reproducing my content!

Yes, so does the user's browser. The purpose of a user agent is to fetch and display content how the user wants. The manner in which that is done is irrelevant.

  • Well, some bots even spoof User-Agents, requesting tons of requests without proper rate-limiting (looking at you, ByteSpider)

    No fair plays done by people, even before the LLMs, so we get the PoW challenge on everywhere.

    And what is that conclusion? since Adblockers are used by anywhere, it is OK to corporates not to license them directly and just yank them and put it into curation service? especially without ads? that's a licensing issue. the author allowed you to view the article if you provide them monetary support (i.e. ads), they didn't allow you to reproduce and republish the work by default.

    also calling browser itself as reproducing? Yes, the data might be copied in memory (but I wouldn't call it as reproducing material, more like transfer from the server to another), but redistribution is the main point here.

    It's like saying well, "the part of the variable is replicated to register from the L2 cache, so whole file on DRAM can be authorized to reproduce", Your point of calling "it's reproducing and should not be reproduced in first place" can't be prevented unless you bring non-turing computers that doesn't use active memory.

    • The only reason you can say "looking at you ByteSpider" is that it identifies itself. In 2025, that qualifies it as a nice bot.

      The nasty bots make a single access from an IP, and don't use it again (for your server), and are disguised to look like a browser hit out of the blue with few identifying marks.

  • There's a difference between what is technically feasible and what is allowed, legally or even morally.

    Just because it is possible -- or even easy -- to essentially steal from newspapers/other media outlets, doesn't make it right, or legal. The people behind it put in labor, financial resources, and time to create a product that, like almost every other service, has terms attached -- and those usually come with some form of monetization. Maybe it is a paywall, maybe it is advertisements -- but it is there.

    Using an adblocker, or finding some loophole around a paywall, etc, are all very easy to do technically, as any reader of this site knows. That said, the media outlet doesn't have to allow it. And when it is violated on an industrial scale, like Perplexity, then they can be understandably upset and take legal action. And that includes any AI (or other technology, for that matter) that is a wrapper around plagiarism.

    Sites opted in to Google originally because it fed them traffic. They most likely did not opt in to an AI rewriter that takes their work and republishes it without any compensation.