Comment by huac
20 hours ago
from an AI research perspective -- it's pretty straightforward to mitigate this attack
1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.
2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...
So not only do I waste their crawling resource but they may deprioritise/block my site from further crawling? Where do I sign up?