Comment by hartator

2 days ago

There are already “infinite” websites like these on the Internet.

Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

Unknown websites will get very few crawls per day whereas popular sites millions.

Source: I am the CEO of SerpApi.

Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop.

  • I just looked at the logs for a site, and I saw PerplexityBot is looking at the robots.txt and ignoring it. They don't provide a list of IPs to verify if it is actually them. Anyway, just for anyone with PerplexityBot in their user agent, they can get increasingly bad responses until the abuse stops.

    • Perplexity is exceptionally bad because they say they respect the robots.txt but clearly don't. When pressed on it they basically shrug and say too bad not put stuff in public if you don't want it crawled. They got a UA block in cloudflare and seems like that did the trick.

      2 replies →

Even a brand new site will get hit heavily by crawlers. Amazonbot, Applebot, LLM bots, scrapers abusing FB's link preview bot, SEO metric bots and more than a few crawlers out of China. The desirable, well behaved crawlers are the only ones who might lose interest.

The typical entry point is a sitemap or RSS feed.

Overall I think the author is misguided in using the tarpit approach. Slow sites get less crawls. I would suggest using easily GZIP'd content and deeply nested tags instead. There are also tricks with XSL, but I doubt many mature crawlers will fall for that one.

> Unknown websites will get very few crawls per day whereas popular sites millions.

we're hosting some pretty unknown very domain specific sites and are getting hammered by Claude and others who, compared to old-school search engine bots also get caught up in the weeds and request the same pages all over.

They also seem to not care about response time of the page they are fetching, because when they are caught in the weeds and hit some super bad performing edge-cases, they do not seem to throttle at all and continue to request at 30+ requests per second even when a page takes more than a second to be returned.

We can of course handle this and make them go away, but in the end, this behavior will only hurt them both because they will face more and more opposition by web masters and because they are wasting their resources.

For decades, our solution for search engine bots was basically an empty robots.txt and have the bots deal with our sites. Bots behaved reasonably and intelligently enough that this was a working strategy.

Now in light of the current AI bots which from an outsider observer's viewpoint look like they were cobbled together with the least effort possible, this strategy is no longer viable and we would have to resort to provide a meticulously crafted robots.txt to help each hacked-up AI bot individually to not get lost in the weeds.

Or, you know, we just blanket ban them.

  • The fact that AI bots seem like they were cobbled together with the least effort possible might be related. The people responsible for these bots might have zero experience writing an old school search engine bot and have no idea of the kind of edge cases that would be encountered. They might just turn to LLMs to write their bot code which is not exactly a recipe for success.

Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling.

The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online.

A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period.

This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.

  • These things are so common having some way of dealing with them is basically mandatory if you plan on doing any sort of large scale crawling.

    That said, crawlers are fairly bug prone, so misbehaving crawlers is also a relatively common sight. It's genuinely difficult to properly test a crawler, and useless to build it from specs, since the realities of the web are so far off the charted territory, any test you build is testing against something that's far removed from what you'll actually encounter. With real web data, the corner cases have corner cases, and the HTTP and HTML specs are but vague suggestions.

    • I am aware of all of the things you mention (I've built crawlers before).

      My point was only that there are plenty of crawlers that don't operate in the way the parent post described. If you want to call them buggy that's fine.

> There are already “infinite” websites like these on the Internet.

Cool. And how much of the software driving these websites is FOSS and I can download and run it for my own (popular enough to be crawled more than daily by multiple scrapers) website?

Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious.

  • > Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious.

    Yep: https://www.energy.gov/articles/doe-releases-new-report-eval...:

    > The report finds that data centers consumed about 4.4% of total U.S. electricity in 2023 and are expected to consume approximately 6.7 to 12% of total U.S. electricity by 2028. The report indicates that total data center electricity usage climbed from 58 TWh in 2014 to 176 TWh in 2023 and estimates an increase between 325 to 580 TWh by 2028.

    A graph in the report says in data centers used 1.9% in 2018.