Comment by walterbell

6 months ago

OpenAI publishes IP ranges for their bots, https://github.com/greyhat-academy/lists.d/blob/main/scraper...

For antisocial scrapers, there's a Wordpress plugin, https://kevinfreitas.net/tools-experiments/

> The words you write and publish on your website are yours. Instead of blocking AI/LLM scraper bots from stealing your stuff why not poison them with garbage content instead? This plugin scrambles the words in the content on blog post and pages on your site when one of these bots slithers by.

39 comments

walterbell

smt88 6 months ago

I have zero faith that OpenAI respects attempts to block their scrapers

tylerchilds 6 months ago

that’s what makes this clever.
they aren’t blocking them. they’re giving them different content instead.

brookst 6 months ago

The latter is clever but unlikely to do any harm. These companies spend a fortune on pre-training efforts and doubtlessly have filters to remove garbage text. There are enough SEO spam pages that just list nonsense words that they would have to.

mrbungie 6 months ago

1. It is a moral victory: at least they won't use your own text.
2. As a sibling proposes, this is probably going to become an perpetual arms race (even if a very small one in volume) between tech-savvy content creators of many kinds and AI companies scrapers.
walterbell 6 months ago
Obfuscators can evolve alongside other LLM arms races.
- ben_w 6 months ago
  
  Yes, but with an attacker having advantage because it directly improves their own product even in the absence of this specific motivation for obfuscation: any Completely Automated Public Turing test to tell Computers and Humans Apart can be used to improve the output of an AI by requiring the AI to pass that test.
  And indeed, this has been part of the training process for at least some of OpenAI models before most people had heard of them.
nerdponx 6 months ago

Seems like an effective technique for preventing your content from being included in the training data then!
rickyhatespeas 6 months ago
It will do harm to their own site considering it's now un-indexable on platforms used by hundreds of millions and growing. Anyone using this is just guaranteeing that their content will be lost to history at worst, or just inaccessible to most search engines/users at best. Congrats on beating the robots, now every time someone searches for your site they will be taken straight to competitors.
- walterbell 6 months ago
  
  > now every time someone searches for your site they will be taken straight to competitors
  There are non-LLM forms of distribution, including traditional web search and human word of mouth. For some niche websites, a reduction in LLM-search users could be considered a positive community filter. If LLM scraper bots agree to follow longstanding robots.txt protocols, they can join the community of civilized internet participants.
  
  1 reply →
- luckylion 6 months ago
  
  You can still fine-tune though. I often run User-Agent: *, Disallow: / with User-Agent: Googlebot, Allow: / because I just don't care for Yandex or baidu to crawl me for the 1 user/year they'll send (of course this depends on the region you're offering things to).
  That other thing is only a more extreme form of the same thing for those who don't behave. And when there's a clear value proposition in letting OpenAI ingest your content you can just allow them to.
- blibble 6 months ago
  
  I'd rather no-one read it and die forgotten than help "usher in the AI era"
  
  2 replies →
- scrollaway 6 months ago
  
  Indeed, it's like dumping rotting trash all over your garden and saying "Ha! Now Jehovah's witnesses won't come here anymore".
  
  1 reply →
wood_spirit 6 months ago

Rather than garbage, perhaps just serve up something irrelevant and banal? Or splice sentences from various random project Gutenberg books? And add in a tarpit for good measure.
At least in the end it gives the programmer one last hoorah before the AI makes us irrelevant :)

ceejayoz 6 months ago

> OpenAI publishes IP ranges for their bots...

If blocking them becomes standard practice, how long do you think it'd be before they started employing third-party crawling contractors to get data sets?

bonestamp2 6 months ago

Maybe they want sites to block them that don't want to be crawled since it probably saves them a lawsuit down the road.

aorth 6 months ago

Note that the official docs from OpenAI listing their user agents and IP ranges is here: https://platform.openai.com/docs/bots

GaggiX 6 months ago

I imagine these companies today are curing their data with LLMs, this stuff isn't going to do anything.

luckylion 6 months ago

That opens up the opposite attack though: what do you need to do to get your content discarded by the AI?
I doubt you'd have much trouble passing LLM-generated text through their checks, and of course the requirements for you would be vastly different. You wouldn't need (near) real-time, on-demand work, or arbitrary input. You'd only need to (once) generate fake doppelganger content for each thing you publish.
If you wanted to, you could even write this fake content yourself if you don't mind the work. Feed Open AI all those rambling comments you had the clarity not to send.
botanical76 6 months ago
You're right, this approach is too easy to spot. Instead, pass all your blog posts through an LLM to automatically inject grammatically sound inaccuracies.
- GaggiX 6 months ago
  
  Are you going to use OpenAI API or maybe setup a Meta model on an NVIDIA GPU? Ahah
  Edit: I found it funny to buy hardware/compute to only fund what you are trying to stop.
  
  2 replies →
sangnoir 6 months ago

> I imagine these companies today are curing their data with LLMs, this stuff isn't going to do anything
The same LLMs tag are terrible at AI-generated-content detection? Randomly mangling words may be a trivially detectable strategy, so one should serve AI-scraper bots with LLM-generated doppelganger content instead. Even OpenAI gave up on its AI detection product
walterbell 6 months ago
Attackers don't have a monopoly on LLM expertise, defenders can also use LLMs for obfuscation.
Technology arms races are well understood.
- GaggiX 6 months ago
  
  I hate LLM companies, I guess I'm going to use OpenAI API to "obfuscate" the content or maybe I will buy an NVIDIA GPU to run a llama model, mhm maybe on GPU cloud.
  
  5 replies →

pmontra 6 months ago

Instead of nonsense you can serve a page explaining how you can ride a bicycle to the moon. I think we had a story about that attack to LLMs a few months ago but I can't find it quickly enough.

sangnoir 6 months ago

iFixIt has detailed fruit-repair instructions. IIRC, they are community-authored.