Humanely dealing with humungus crawlers

7 hours ago (flak.tedunangst.com)

37 comments

freediver

>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.

bayindirh 4 hours ago
My view on this is simple:
If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.
No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.
That's all, thanks.
- beeflet 3 hours ago
  
  I think the problem is that despite the effort, you will still end up in the dataset. So it's futile
- warkdarrior 4 hours ago
  
  How can you tell a bot will ignore all your content licenses?
  
  8 replies →
Vegenoid 4 hours ago
I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.
The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.
jitl 6 hours ago
Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.
- marginalia_nu 5 hours ago
  
  Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.
  Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.
  Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.
  
  7 replies →
- phantompeace 5 hours ago
  
  Wouldn't it be easier to put the unsophisticated blog behind cloudflare
  
  1 reply →

hyperman1 3 hours ago

I've been wondering about how to make a challenge that AI won't do. Some possibilities:

* Type this sentence, taken from a famous copyrighted work.

* Type Tienanmen protests.

* Type this list of swear words or sexual organs.

dweinus 3 hours ago

> Type this list of swear words
1998: I swear at the computer until the page loads
2025: I swear at the computer until the page loads

nektro 5 hours ago

it's sad we've gotten to the point where mitigations against this have to be such a consideration when hosting a site

arjie 44 minutes ago

They don't really have to be. I don't have many mitigations and the AI bots crawl my site and it's fine. The robots.txt is pretty simple too and is really just set up to help the robot not get stuck in loops (I use Mediawiki as the CMS and it has a lot of GET paths that a normal person wouldn't choose). In my case, a machine near my desk hosts everything and it's fine.

michaeljx 3 hours ago

For some reason I thought this would be about dealing with very large insects

nickpsecurity 27 minutes ago

I made my pages static HTML with no images, used a fast server, and BunnyCDN (see profile domain). Ten thousand hits a day from bots costs a penny or something. When I'm using images, I link to image hosting sites. It might get more challenging if I try to squeeze meme images in between every other paragraph to make my sites more beautiful.

Far as Ted's article, the first thing that popped in my head is that most AI crawlers hitting my sites are in big, datacenter cities: Dallas, Dublin, etc. I wonder if I could easily geo-block those cities or redirect them to pages with more checks built-in. I just haven't looked into that on my CDN's or in general in a long time.

They also usually request files from popular, PHP frameworks and othrr things like that. If you don't use PHP, you could autoban on the first request for a PHP page. Likewise for anything else you don't need.

Of the two, looking for .php is probably lightening quick with low, CPU/RAM utilization in comparison.

zkmon 5 hours ago

[flagged]

politelemon 5 hours ago

The point of a blog is whatever the author would like it to be. It doesn't have to follow a structure or expectations. We just happen to be consuming it.
bayindirh 4 hours ago
> Sorry, what's the point of this blog?
Being a blog the way the author dreamed of it.
> I hope people would write a quick abstract/summary in the first few lines and then go on elaborating.
I hope people continue doing what makes them happy. It's their site, they owe nothing to anyone (maybe except hosting / network fees, but that's not my business, either).
> Or at least put that summary at the end, in old-fashioned way.
Or maybe people can spend a couple of minutes to read and understand it, with the MSI (MeatSpaceIntelligence) which comes bundled with all human beings.
It's free, too!
- zkmon 3 hours ago
  
  Maybe, maybe .. you get some pleasure in forcing people to read every bit of what you write, just to get what the heck it is. But unfortunately it is the age of AI summaries and short attention spans. Not the times when you read half-foot thick novels end-to-end multiple times. TL;DR!
  
  1 reply →

kiitos 2 hours ago

what a just totally bizarre perspective

all of the stuff that's being complained-about is absolute 100% table-stakes stuff that every http server on the public internet has needed to deal with since, man, i dunno, minimum 15 years now?

as a result literally nobody self-hosts their own HTTP content any more, unless they enjoy the challenge in like a problem-solving sense

if you are even self-hosting some kind of captcha system you've already make a mistake, but apparently this guy is not just hosting but building a bespoke one? which is like, my dude, respect, but this is light years off of the beaten path

the author whinges about google not doing their own internal rate limiting in some presumed distributed system, before any node in that system makes any http request over the open internet. that's fair and not doing so is maybe bad user behavior but on the open internet it's the responsibility of the server to protect itself as it needs to, it's not the other way around

everything this dude is yelling about is immediately solved by hosting thru a hosting provider, like everyone else does, and has done, since like 2005