I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)
Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.
I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.
I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?
i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.
If I hit it with Chrome, now I can see a site.
Seems pretty not ready for prime time as a lot of my viewers use Firefox
Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).
I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)
[0] https://iocaine.madhouse-project.org
Can't seem to access this.
It flashes some text briefly then gives me an 418 TEAPOT response. I wonder if it's because I'm on Linux?
EDIT: Begrudgingly checked Chrome, and it loads. I guess it doesn't like Firefox?
Doesn't work on my firefox either.
Friendly fire, I suppose.
1 reply →
Nor Safari on iOS.
2 replies →
Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.
I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.
I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?
> You've wasted 500Mb of bandwidth.
Yep, it sucks, but on the positive side, I'm feeding 500Mb of garbage into them every day and that feels like enough of a small win for me.
> My monthly b/w cost is now around 20-30Gb a month given to scrapers [...] 1-2Gb a month
That definitely sucks.
> Do I go offline or let it continue?
Might be time to start blocking entire IP ranges and ASNs and see if that helps.
i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.
If I hit it with Chrome, now I can see a site.
Seems pretty not ready for prime time as a lot of my viewers use Firefox
One of the most popular ones is Anubis. It uses a proof of work and can even do poisoning: https://news.ycombinator.com/item?id=44378127
Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).
https://forge.hackers.town/hackers.town/nepenthes
> Citation needed
this reply kinda sucks :)
[dead]