Comment by m3047
20 days ago
(I was going to post "run a bot motel" as a topline, but I get tired of sounding like broken record.)
To generate garbage data I've had good success using Markov Chains in the past. These days I think I'd try an LLM and turning up the "heat".
Wouldn't your own LLM be overkill? Ideally one would generate decoy junk more much efficiently than these abusive/hostile attackers can steal it.
I still think this could worthwhile though for these reasons.
- One "quality" poisoned document may be able to do more damage - Many crawlers will be getting this poison, so this multiplies the effect by a lot - The cost of generation seems to be much below market value at the moment
I didn't run the text generator in real time (that would defeat the point of shifting cost to the adversary, wouldn't it?). I created and cached a corpus, and then selectively made small edits (primarily URL rewriting) on the way out.