← Back to context

Comment by heywire

1 year ago

Kind of makes me want to build a dynamic web server that spews plausible garbage to poison their training set. Probably a bit like peeing in the ocean though.

That's not too complicated. I did this with random text years ago, telling those spiders which honor robots.txt not to visit the starting URL. The dynamic URL led to "twisty little passages all different" i.e. a tarpit.

And it's really easy to generate random images with ImageMagick, even with random text on top to feed their OCR needs.

Indeed. The dynamics created by Google do not favor this approach.

If you spew garbage, your page gets de-ranked. If your page is de-ranked, the page doesn't "exist" (effectively un-findable by most of the world).

So it's the classic rock and a hard place. For me, I still strive to create high quality, useful content because I hope it helps others. I've given up concerning myself with scrapers feeding LLMs and AI.