← Back to context

Comment by to11mtm

14 hours ago

Not a crawler writer but have FAFOd with data structures in the past to large career success.

...

The closest you could possibly do with any meaningful influence, is option C, with the general observations of:

1. You'd need to 'randomize' the generated output link

2. You'd also want to maximize cachability of the replayed content to minimize work.

3. Add layers of obfuscation on the frontend side, for instance a 'hidden link (maybe with some prompt fuckery if you are brave) inside the HTML with a random bad link on your normal pages.

4. Randomize parts of the honeypot link pattern. At some point someone monitoring logs/etc will see that it's a loop and blacklist the path.

5. Keep up at 4 and eventually they'll hopefully stop crawling.

---

On the lighter side...

1. do some combination of above but have all honeypot links contain the right words that an LLM will just nope out of for regulatory reasons.

That said, all above will do is minimize pain (except, perhaps ironically, the joke response which will more likely blacklist you but potentially get you on a list or a TLA visit)...

... Most pragmatically, I'd start by suggesting the best option is a combination of nonlinear rate limiting, both on the ramp-up and the ramp-down. That is, the faster requests come in, the more you increment their 'valueToCheckAgainstLimit`. The longer it's been since last request, the more you decrement.

Also pragmatically, if you can extend that to put together even semi-sloppy code to then scan when a request to a junk link that results in a ban immediately results in another IP trying to hit the same request... well ban that IP as soon as you see it, at least for a while.

With the right sort of lookup table, IP Bans can be fairly simple to handle on a software level, although the 'first-time' elbow grease can be a challenge.