Comment by tommek4077

25 days ago

How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

16 comments

tommek4077

stinky613 25 days ago

A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)

chao- 25 days ago

I have encountered this same issue with faceted search results and individual inventory listings.

switz 25 days ago

If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate

tommek4077 25 days ago
How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.
- switz 23 days ago
  
  Around a hundred per second at peak. Even though my server can handle it just fine, it muddies up the logs and observability for something I genuinely do not care about at all. I only care about seeing real users' experience. It's just noise.
account42 24 days ago

Even moderately sized wikis have a huge number of different page versions which can all be accessed individually.

roblh 25 days ago

There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

blell 25 days ago

In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

Qwertious 25 days ago

It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.
tclancy 25 days ago
Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.
- tommek4077 25 days ago
  
  But serving HTML is unbelievably cheap, isn't it?
  
  2 replies →

j16sdiz 25 days ago

The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view. Those pages are dynamically generated and virtually limitless

kpcyrd 25 days ago

The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server

jjgreen 25 days ago

Time for a vinyl-style Perl revival ...