Comment by chmod775

6 months ago

Fight fire with fire by serving these guys LLM output of made-up news. Wish them good luck noticing that in their dataset.

2 comments

chmod775

johntash 6 months ago

I think there was some sort of fake webserver that did something like this already. Basically just linked endlessly to more llm-generated pages of nonsense.

zbentley 6 months ago

There are several!
Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0
Some focus on generating linked pages: https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in...
Some of them play the long game and try to poison models' data: https://codeberg.org/konterfai/konterfai
There are lots more as well; those are just a few of the ones that recently made the rounds.
I suspect that combining approaches will be a tractable way to waste time:
- Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,
- CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),
- Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).
- Model poisoners out of sheer pettiness, if it brings you joy.
I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.