← Back to context

Comment by ronsor

14 hours ago

I don't know why Perplexity in particular gets everyone in a nit. It's not even particularly special: a user inputs a query, an AI model does a web search and fetches some pages on the user's behalf, and then it serves the result to the user.

Putting aside that other products, such as OpenAI's ChatGPT and modern Google Search have the same "AI-powered web search" functionality, I can't see how this is meaningfully different from a user doing a web search and pasting a bunch of webpages into an LLM chat box.

> But what about ad revenue?

The user could be using an ad blocker. If they're using Perplexity at all, they probably already are. There's no requirement for a user agent to render ads.

> But robots.txt!!!11

`robots.txt` is for recursive, fully automated requests. If a request is made on behalf of a user, through direct user interaction, then it may not be followed and IMO shouldn't be followed. If you really want to block a user agent, it's up to you to figure out how to serve a 403.

> It's breaking copyright by reproducing my content!

Yes, so does the user's browser. The purpose of a user agent is to fetch and display content how the user wants. The manner in which that is done is irrelevant.

> I don't know why Perplexity in particular gets everyone in a nit

I suspect they seem easier to sue than OpenAI, Anthropic, Meta, Google, and literally anything coming out of china.

Well, some bots even spoof User-Agents, requesting tons of requests without proper rate-limiting (looking at you, ByteSpider)

No fair plays done by people, even before the LLMs, so we get the PoW challenge on everywhere.

And what is that conclusion? since Adblockers are used by anywhere, it is OK to corporates not to license them directly and just yank them and put it into curation service? especially without ads? that's a licensing issue. the author allowed you to view the article if you provide them monetary support (i.e. ads), they didn't allow you to reproduce and republish the work by default.

also calling browser itself as reproducing? Yes, the data might be copied in memory (but I wouldn't call it as reproducing material, more like transfer from the server to another), but redistribution is the main point here.

It's like saying well, "the part of the variable is replicated to register from the L2 cache, so whole file on DRAM can be authorized to reproduce", Your point of calling "it's reproducing and should not be reproduced in first place" can't be prevented unless you bring non-turing computers that doesn't use active memory.

  • The only reason you can say "looking at you ByteSpider" is that it identifies itself. In 2025, that qualifies it as a nice bot.

    The nasty bots make a single access from an IP, and don't use it again (for your server), and are disguised to look like a browser hit out of the blue with few identifying marks.

There's a difference between what is technically feasible and what is allowed, legally or even morally.

Just because it is possible -- or even easy -- to essentially steal from newspapers/other media outlets, doesn't make it right, or legal. The people behind it put in labor, financial resources, and time to create a product that, like almost every other service, has terms attached -- and those usually come with some form of monetization. Maybe it is a paywall, maybe it is advertisements -- but it is there.

Using an adblocker, or finding some loophole around a paywall, etc, are all very easy to do technically, as any reader of this site knows. That said, the media outlet doesn't have to allow it. And when it is violated on an industrial scale, like Perplexity, then they can be understandably upset and take legal action. And that includes any AI (or other technology, for that matter) that is a wrapper around plagiarism.

Sites opted in to Google originally because it fed them traffic. They most likely did not opt in to an AI rewriter that takes their work and republishes it without any compensation.