Comment by CharlieDigital

1 year ago

The most straightforward is training data for an LLM.

Kind of makes me want to build a dynamic web server that spews plausible garbage to poison their training set. Probably a bit like peeing in the ocean though.

  • That's not too complicated. I did this with random text years ago, telling those spiders which honor robots.txt not to visit the starting URL. The dynamic URL led to "twisty little passages all different" i.e. a tarpit.

    And it's really easy to generate random images with ImageMagick, even with random text on top to feed their OCR needs.

  • Indeed. The dynamics created by Google do not favor this approach.

    If you spew garbage, your page gets de-ranked. If your page is de-ranked, the page doesn't "exist" (effectively un-findable by most of the world).

    So it's the classic rock and a hard place. For me, I still strive to create high quality, useful content because I hope it helps others. I've given up concerning myself with scrapers feeding LLMs and AI.