← Back to context

Comment by zzzeek

1 day ago

Citation needed

I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)

[0] https://iocaine.madhouse-project.org

  • Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.

    I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.

    I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?

    • > You've wasted 500Mb of bandwidth.

      Yep, it sucks, but on the positive side, I'm feeding 500Mb of garbage into them every day and that feels like enough of a small win for me.

      > My monthly b/w cost is now around 20-30Gb a month given to scrapers [...] 1-2Gb a month

      That definitely sucks.

      > Do I go offline or let it continue?

      Might be time to start blocking entire IP ranges and ASNs and see if that helps.

  • i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.

    If I hit it with Chrome, now I can see a site.

    Seems pretty not ready for prime time as a lot of my viewers use Firefox

One of the most popular ones is Anubis. It uses a proof of work and can even do poisoning: https://news.ycombinator.com/item?id=44378127

  • Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).