← Back to context

Comment by bobbiechen

10 hours ago

>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.

My view on this is simple:

If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.

No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.

That's all, thanks.

  • I think the problem is that despite the effort, you will still end up in the dataset. So it's futile

  • How can you tell a bot will ignore all your content licenses?

    • Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.

      7 replies →

I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.

The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.

Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.

  • Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

    Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.

    Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.

    • >Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

      I actually find this surprisingly difficult to find.

      I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.

      There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

      6 replies →

  • Wouldn't it be easier to put the unsophisticated blog behind cloudflare

    • As much as I like to shit on cloudflare at every opportunity, it would obviously be easier to put it behind CF than install bot detection plugins.