← Back to context

Comment by darknoon

1 month ago

Here's the problem, you're still going to get scraped and the LLM will understand it anyway. Maybe at best you'll get filtered out of the dataset b/c it's high perplexity text?

Do training scrapers really feed all their input through an LLM to decode it? That sounds expensive and most content probably doesn't need that. If they don't, then this method probably works to keep your stuff out of the training datasets.

Is the problem with scraping the bandwidth usage or the stealing of the content? The point here doesn’t seem to be obfuscation from direct LLM inference (I mean, I use Shottr on my MacBook to immediately OCR my screenshots) but rather stopping you from ending up in the dataset.

Is there a reason you believe getting filtered out is only a “maybe?” Not getting filtered out would seem to me to imply that LLM training can naturally extract meaning from obfuscated tokens. If that’s the case, LLMs are more impressive than I thought.