Comment by johntash

3 months ago

I think there was some sort of fake webserver that did something like this already. Basically just linked endlessly to more llm-generated pages of nonsense.

1 comment

johntash

zbentley 3 months ago

There are several!

Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0

Some focus on generating linked pages: https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in...

Some of them play the long game and try to poison models' data: https://codeberg.org/konterfai/konterfai

There are lots more as well; those are just a few of the ones that recently made the rounds.

I suspect that combining approaches will be a tractable way to waste time:

- Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,

- CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),

- Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).

- Model poisoners out of sheer pettiness, if it brings you joy.

I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.