Comment by schmookeeg

6 months ago

I'm not as allergic to AI content as some (although I'm sure I'll get there) -- but I admire this analogy to low-background steel. Brilliant.

I am not allergic to it either (and I created the site). The idea was to keep track of stuff that we know humans made.

> I'm not as allergic to AI content as some

I suspect it's less about phobia, more about avoiding training AI on its own output.

This is actually something I'd been discussing with colleagues recently. Pre-AI content is only ever going to become more precious because it's one thing we can never make more of.

Ideally we'd have been cryptographically timestamping all data available in ~2015, but we are where we are now.

  • One surprising thing to me is that using model outputs to train other/smaller models is standard fare and seems to work quite well.

    So it seems to be less about not training AI on its own outputs and more about curating some overall quality bar for the content, AI-generated or otherwise

    • Back in the early 2000s when I was doing email filtering using naive Bayes in my POPFile email filter one of the surprising results was that taken the output of the filter as correct and retraining on a message as if it had been labelled by a human worked well.

      2 replies →

  • >more about avoiding training AI on its own output.

    Exactly. The analogy I've been thinking of is if you use some sort of image processing filter over and over again to the point that it overpowers the whole image and all you see is the noise generated from the filter. I used to do this sometimes with Irfanview and it's sharp and blur.

    And I believe that I've seen TikTok videos showing AI constantly iterating over an image and then iterating over its output with the same instructions and seeming to converge on a style of like a 1920s black and white cartoon.

    And I feel like there might be such a thing as a linguistic version of that. Even a conceptual version.

  • I'm worried about humans training on AI output. Example, a rare fish had a viral AI image made. The image is completely fake. Though, when you search for that fish, the image is what comes up, repeatedly. It is hard to know it is all fake, looks real. Content fabrication at scale has a lot of second order impacts.

  • It’s about keeping different corpuses of written material that was created by humans, for research purposes. You wouldn’t want to contaminate your human language word frequency databases with AI slop, the linguists of this world won’t like it.