Comment by mschuster91

5 months ago

Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].

Not sure how to implement it in the cloud though, never had the need for that there yet.

[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...

6 comments

mschuster91

jks 5 months ago

One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://web.archive.org/web/20250117030633/https://zadzmo.or...

marcus0x62 5 months ago

Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.
0 - https://marcusb.org/hacks/quixotic.html
kazinator 5 months ago

How do you know their site is down? You probably just hit their tarpit. :)

bwfan123 5 months ago

i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.

idlewords 5 months ago

This doesn't work with the kind of highly distributed crawling that is the problem now.

seethenerdz 5 months ago

Don't we have intellectual property law for this tho?