← Back to context

Comment by chmod775

8 days ago

Fight fire with fire by serving these guys LLM output of made-up news. Wish them good luck noticing that in their dataset.

I think there was some sort of fake webserver that did something like this already. Basically just linked endlessly to more llm-generated pages of nonsense.

  • There are several!

    Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0

    Some focus on generating linked pages: https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in...

    Some of them play the long game and try to poison models' data: https://codeberg.org/konterfai/konterfai

    There are lots more as well; those are just a few of the ones that recently made the rounds.

    I suspect that combining approaches will be a tractable way to waste time:

    - Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,

    - CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),

    - Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).

    - Model poisoners out of sheer pettiness, if it brings you joy.

    I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.