Messing with scraper bots

16 hours ago (herman.bearblog.dev)

The more things change, the more they stay the same.

About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.

Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.

The scraping stopped within two days and never came back.

--

[0] Random but deterministic based on post ID, so the injected text stayed consistent.

[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.

[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.

  • In the movie The Imitation Game, the Alan Turing character recognizes that acting 100% of the time gives away to the opposition that you identified them and sets off the next iteration of “cat and mouse”. He comes up with a specific percentage of the time that the Allies should sit on the intelligence and not warn their own people.

    If, instead, you only act on a percentage of requests, you can add noise in an insidious way without signaling that you caught them. It will make their job troubleshooting and crafting the next iteration much harder. Also, making the response less predictable is a good idea - throw different HTTP error codes, respond with somewhat inaccurate content, etc

  • Thank you very much for the observation about headers. I just looked closer at the bot traffic I'm currently receiving on my small fediverse server and noticed that it's user agents of old Chrome versions but also that the Accept-Language header is never set, which is indeed something that no real Chromium browser would do. So I added a rule to my nginx config to return a 403 to these requests. The amount of these per second seems to have started declining.

    • It's been a few hours. These particular bots have completely stopped. There are still some bot-looking requests in the log, with a newer-version Chrome UA on both Mac and Windows, but there aren't nearly as many of them.

      Config snippet for anyone interested:

          if ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
            set $block 1;
          }
          if ($http_accept_language = "") {
            set $block "${block}1";
          }
          if ($block = "11") {
            return 403;
          }

    • That's a simple and effective way to block a lot of bots, gonna implement that on my sites. Thanks!

  • Why do the company names chase away bots? Is it just that you’re destroying their signal because they’re looking for mentions of those brands?

    • It’s both a destruction of signal and an injection of noise. Imagine you worked for Adidas and you started getting a stream of notifications about your brand, and they were all nonsense. This would be an annoyance and harm the reputation of that monitoring service.

      They would have received multiple complaints about it from customers, performed an investigation, and ultimately perform a manual excision of the junk data from their system; both the raw scrapes and anywhere it was ingested and processed. This was probably a simple operation, but might not have been if their architecture didn’t account for this vulnerability.

    • I also didn't follow that part. Their step 2 seem to be a general-purpose bot detection strategy that works independently of their step 1 ("randomly mention companies").

      3 replies →

  • The vast majority of bots are still failing the header test - we organically arrived at the except same filtering in 2025. The bots followed the exact same progression too. One ip, lie about the user agent, one ASN, multiple ASNs, then lie about everything and use residential IPs, but still botch the headers

This is a fundamental misunderstanding of what those bots are requesting. They aren’t parsing those PHP files, they are using their existence for fingerprinting — they are trying to determine the existence of known vulnerabilities. They probably immediately stop reading after receiving a http response code and discard the remainder of the request packets.

  • You're right, something like fail2ban or crowdsec would probably be more effective here. Crowdsec has made it apparent to me how much vulnerability probing is done, its a bit shocking for a low-traffic host.

    • And you'd ban the ip, their one day lease on the VM+IP would expire, someone else will get the same IP on a new VM and be blocked from everywhere.

      Would be usable to ban the ip for a few hours to have the bot cool down for a bit and move onto a next domain.

      1 reply →

  • It would be such a terrible thing if some LLM scrapers were using those responses to learn more about PHP, especially because of that recent paper pointing out it doesn't take that many data points to poison LLMs.

Neat! Most of the offensive scrapers I met try and exploit WordPress sites (hence the focus on PHP). They don't want to see php files, but their outputs.

What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.

  • Yeah, I bet they run a regex on the output and if there's no admin logon thingie where they can run exploits or stuff credentials they'll just skip.

    So as to battles of efficiency, generating a 4kb bullshit PHP is harder than running a regex.

I remember when you used to get scolded on HN for preventing scrapers or bots. "How I access your site is irrelevant".

  • It's different. I'm fine with someone scraping my website as a good citizen, by identifying themselves in their user-agent string and preferably respecting robots.txt. I'm not, however, fine with tens of requests per second to every possible URL from random IPs I'm receiving right now, all pretending to be different old versions of Chrome.

  • There's this and that. "How I [i.e. an individual human looking for myself] access your site is irrelevant." and "How I [i.e. an AI company DDOSing (which is illegal in some places btw) trying to maximize profit and offloading cost to you] access your site is irrelevant."

    When you get paid big buck to make the world worse for everyone it's really simple forgetting "little details".

  • I have a side project as an academic that scrape a couple of academic jobs sites in my field and then serve them in static HTML page. It is running using github action and request every 24 hours exactly one time. It is useful for me and a couple of people in my circle. I would consider this to be fine and within the reasonable expectations. Many projects rely on such scenarios and people share them all the time.

    It is completely different if I am hitting it looking for WordPress vulnerabilities or scraping content every minute for LLM training material.

  • To me that's the one of the most depressing developments about AI (which is chock-full of depressing developments): that its mere existence is eroding long-held ethics, not even necessarily out of a lack of commitment but out of practical necessity.

    The tech people are all turning against scraping, independent artists are now clamoring for brutal IP crackdowns and Disney-style copyright maximalism (which I never would've predicted just 5 years ago, that crowd used to be staunchly against such things), people everywhere want more attestation and elimination of anonymity now that it's effectively free to make a swarm of convincingly-human misinformation agents, etc.

    It's making people worse.

If you control your own Apache server and just want to shortcut to "go away" instead of feeding scrapers, the RewriteEngine is your friend, for example:

      RewriteEngine On

      # Block requests that reference .php anywhere (path, query, or encoded)
      RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
      RewriteCond %{QUERY_STRING} \.php [NC,OR]
      RewriteCond %{THE_REQUEST} \.php [NC]
      RewriteRule .* - [F,L]

Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ.

I had to revisit my strategy after posting about my zipbombs on HN [0]. My server traffic went from tens of thousands to ~100k daily, hosted on a $6 vps. It was not sustainable.

Now I target only the most aggressive bots with zipbombs and the rest get a 403. My new spam strategy seems to work, but I don't know if I should post it on HN again...

[0]: https://news.ycombinator.com/item?id=43826798

This reminds me of a recent discussion about using a tarpit for A.I. and other scrapers. I've kept a tab alive with a reference to a neat tool and approach called Nepenthes that VERY SLOWLY drip feeds endless generated data into the connection. I've not had an opportunity to experiment with it as yet:

https://zadzmo.org/code/nepenthes/

I always had fail2ban but a while back I wanted to set up something juicier...

.htaccess diverts suspicious paths (e.g., /.git, /wp-login) to decoy.php and forces decoy.zip downloads (10GB), so scanners hitting common “secret” files never touch real content and get stuck downloading a huge dummy archive.

decoy.php mimics whatever sensitive file was requested by endless streaming of fake config/log/SQL data, keeping bots busy while revealing nothing.

They’re not scraping for php files, they’re probing for known vulns in popular frameworks, and then using them as entry points for pwning.

This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.

What about using zip bombs?

https://idiallo.com/blog/zipbomb-protection

I wonder if the abuse bots could be somehow made to mine some crypto to give back to the bills they cause

  • You could try to get them to run JavaScript, but I'm sure many is them have countermeasures.

Interesting! It's nice to see people are experimenting with these, and I wonder if this kind of junk data generators will become its own product. Or maybe at least a feature/integration in existing software. I could see it going there.

These aren't scraper bots; they're vulnerability scanners. They don't expect PHP source code and probably don't even read the response body at all.

I don't know why people would assume these are AI/LLM scrapers seeking PHP source code on random servers(!) short of it being related to this brainless "AI is stealing all the data" nonsense that has infected the minds of many people here.

Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?

I'd sacrifice two CPU cores for this just to make their life awful.

  • You don't need an LLM for that. There is a link in the article to an approach using Markov chains created from real-world books, but then you'd let the scrapers' LLMs re-enforce their training on those books and not on random garbage.

    I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast.

    I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.

Don’t get me wrong, but what’s the problem with scrapers? People invest in SEO to become more visible, yet at the same time they fight against “scraper bots.” I’ve always thought the whole point of publicly available information is to be visible. If you want to make money, just put it behind a paywall. Isn’t that the idea?

  • There's a difference between putting information easily online for your customers or even people in general (eg as a hobby), and working in concert with scraping for greater visibility via search, and giving that work away, or at a cost, to companies who at best don't care and possibly may be competition, see themselves as replacing you or otherwise adversarial.

    The line is "I technically and able to do this" and "I am engaging with a system in good faith".

    Public parks are just there and I can technically drive up and dump rubbish there and if they didn't want me to they should have installed a gate and sold tickets.

    Many scrapers these days are sort of equivalent in that analogy to people starting entire fleets of waste disposal vehicles that all drive to parks to unload, putting strain on park operations and making the parks a less tenable service in general.

    • > The line is "I technically and able to do this" and "I am engaging with a system in good faith".

      This is where the line should be, always. But in practice this criterion is applied very selectively here on HN and elsewhere.

      After all: What is ad blocking, other than direct subversion of the site owner's clear intention to make money from the viewer's attention?

      Applying your criterion here gives a very simple conclusion: If you don't want to watch the ads, don't visit the site.

      Right?

  • The old scrapers indexed your site so you may get traffic. This benefits you.

    AI scrapers will plagiarise your work and bring you zero traffic.

  • You are correct, and the hard reality is that content producers don't get to pick and choose who gets to index their public content because the bad bots don't play by the rules of robots.txt or user-agent strings. In my experience, bad bots do everything they can to identify as regular users: fake IPs, fake agent strings...so it's hard to sort them from regular traffic.

  • Did you read TFA?

    These scrapers drown peoples' servers in requests, taking up literally all the resources and driving up cost.