Comment by m3047

19 days ago

(I was going to post "run a bot motel" as a topline, but I get tired of sounding like broken record.)

To generate garbage data I've had good success using Markov Chains in the past. These days I think I'd try an LLM and turning up the "heat".

Wouldn't your own LLM be overkill? Ideally one would generate decoy junk more much efficiently than these abusive/hostile attackers can steal it.

  • I still think this could worthwhile though for these reasons.

    - One "quality" poisoned document may be able to do more damage - Many crawlers will be getting this poison, so this multiplies the effect by a lot - The cost of generation seems to be much below market value at the moment

  • I didn't run the text generator in real time (that would defeat the point of shifting cost to the adversary, wouldn't it?). I created and cached a corpus, and then selectively made small edits (primarily URL rewriting) on the way out.