Comment by CharlieDigital

1 year ago

The most straightforward is training data for an LLM.

4 comments

CharlieDigital

Kind of makes me want to build a dynamic web server that spews plausible garbage to poison their training set. Probably a bit like peeing in the ocean though.

jcynix 1 year ago

That's not too complicated. I did this with random text years ago, telling those spiders which honor robots.txt not to visit the starting URL. The dynamic URL led to "twisty little passages all different" i.e. a tarpit.
And it's really easy to generate random images with ImageMagick, even with random text on top to feed their OCR needs.
CharlieDigital 1 year ago

Indeed. The dynamics created by Google do not favor this approach.
If you spew garbage, your page gets de-ranked. If your page is de-ranked, the page doesn't "exist" (effectively un-findable by most of the world).
So it's the classic rock and a hard place. For me, I still strive to create high quality, useful content because I hope it helps others. I've given up concerning myself with scrapers feeding LLMs and AI.
123yawaworht456 1 year ago

that's already done inadvertently
raw scrap from 2023+ is worthless