Comment by drcongo

24 days ago

That site doesn't seem to support pages loading either.

edit: I feel their pain - I've spent the past week fighting AI scrapers on multiple sites hitting routes that somehow bypass Cloudflare's cache. Thousands of requests per minute, often to URLs that have never even existed. Baidu and OpenAI, I'm looking at you.

18 comments

drcongo

comrade1234 24 days ago

Are they hitting non-existent pages? I had ip addresses scanning my personal server including hitting pages that don't exist. I had fail2ban running already so I just turned on the nginx filters (and had to modify the regexs a bit to get them working). I turned on the recididiv jail too. It's been working great.

trollbridge 24 days ago

There is currently some AI scraper that uses residential IP addresses and a variety of techniques to conceal itself that likes downloading Swagger generated docs over… and over… and over.

Plus hitting the endpoints for authentication that return 403 over and over.

mystraline 24 days ago

IP blocking Asia took my abusive scans down 95%.

I also do not have a robots.txt so google doesnt index.

Got some scanners who left a message how to index or dei dex, but was like 3 lines total in my log (thats not abusive).

But yeah, blocking the whole of Asia stopped soooo much of the net-shit.

blenderob 24 days ago
> I also do not have a robots.txt so google doesnt index.
That doesn't sound right. I don't have robots.txt too but Google indexes everything for me.
- mystraline 24 days ago
  
  https://news.ycombinator.com/item?id=46681454
  I think this is a recent change.
  
  1 reply →
Citizen_Lame 24 days ago
How did you block Asia, cloudflare or something else?
- kjs3 23 days ago
  
  You can block at your gateway/router. Lots of places have country IP ranges[1], and there are even more or less frequently updated lists of 'malicious' IP ranges[2]. Some gateway providers include 'block by country' and/or 'download blocklists automatically' as a feature.
  [1] e.g. https://github.com/ipverse/geo-ip-blocks
  [2] e.g. https://github.com/bitwire-it/ipblocklist
- mystraline 23 days ago
  
  You can download weekly IP blocks of regions.
  I import them into iptables and wholesale block them all.
  I dont deal with eastdakota's pile of shit.
allarm 21 days ago

If you block the rest you get to 100%.

ndriscoll 24 days ago

My n100 minipc can serve over 20k requests per second with nginx (well, it could, if not for the gigabit NIC limiting it). Actually IIRC it can (again, modulo uplink) do more like 40k rps for 404 or 304s.

storystarling 24 days ago

Might be worth checking if they are appending random query strings to force cache misses. Usually you can normalize the request at the edge to strip those out and protect the origin.

jen729w 24 days ago

> often to URLs that have never even existed

Oh you're so deterministic.

tommek4077 24 days ago

Why are "thousands" of requests noticable in any way? Webservers are so powerful nowadays.

SoftTalker 24 days ago
Small, cheap VPSs that are ideal for running a small niche-interest blog or forum will easily fall over if they suddenly get thousands of requests in a short time.
Look at how many sites still get "HN hugged" (formerly known as "slashdotted").
- ronsor 24 days ago
  
  I remember my first project posted to HN was hosted on a router with 32MB of RAM and a puny MIPS CPU; despite hitting the front page, it did not crash.
  At this point, I have to assume that most software is too inefficient to be exposed to the Internet, and that becomes obvious with any real load.
  
  1 reply →
drcongo 24 days ago

It's not just one scraper.