Comment by darknoon

2 months ago

Here's the problem, you're still going to get scraped and the LLM will understand it anyway. Maybe at best you'll get filtered out of the dataset b/c it's high perplexity text?

3 comments

darknoon

201984 2 months ago

Do training scrapers really feed all their input through an LLM to decode it? That sounds expensive and most content probably doesn't need that. If they don't, then this method probably works to keep your stuff out of the training datasets.

cwillu 2 months ago

They don't need to decode it first, it can be passed in directly.

lumirth 2 months ago

Is the problem with scraping the bandwidth usage or the stealing of the content? The point here doesn’t seem to be obfuscation from direct LLM inference (I mean, I use Shottr on my MacBook to immediately OCR my screenshots) but rather stopping you from ending up in the dataset.

Is there a reason you believe getting filtered out is only a “maybe?” Not getting filtered out would seem to me to imply that LLM training can naturally extract meaning from obfuscated tokens. If that’s the case, LLMs are more impressive than I thought.