← Back to context

Comment by dawnerd

2 days ago

Perplexity is exceptionally bad because they say they respect the robots.txt but clearly don't. When pressed on it they basically shrug and say too bad not put stuff in public if you don't want it crawled. They got a UA block in cloudflare and seems like that did the trick.

User Agent block just means they'd spoof their user agent.

  • That generally gives you even more trouble with cloudflare. Behaving in any way inconsistent with your UA string is one of the easiest methods of identifying bots.

    Yeah you can use headless browsers, but then you're also using orders of magnitude more compute, and that's not really something that scales.

    The best way to avoid ending up in captcha-land is to say who you are, and respect robots.txt.