Comment by NoiseBert69
17 hours ago
Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?
I'd sacrifice two CPU cores for this just to make their life awful.
17 hours ago
Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?
I'd sacrifice two CPU cores for this just to make their life awful.
He addresses that. Basically, there are gatekeepers and if you get on the wrong side of them, only manual intervention can save you. And we all know how Google loves providing a human to resolve problems.
> I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using robots.txt, nofollow, and noindex rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming.
You don't need an LLM for that. There is a link in the article to an approach using Markov chains created from real-world books, but then you'd let the scrapers' LLMs re-enforce their training on those books and not on random garbage.
I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast.
I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.
That's very expensive.