← Back to context

Comment by ACCount36

6 months ago

Currently, there is no reason to believe that "AI contamination" is a practical issue for AI training runs.

AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.

Yeah, the thinking behind "low background steel" concept is that AI training on synthetic data could lead into a "model collapse" that render the AIs anyhow completely mad and useless. That either didn't happen, or all the AI companies internally holds a working filter to sieve out AI data. I'd bet on the former. I still think there might be chances of model collapse happening to humans after too much exposure to AI generated data, but that's just my anecdotal observations and gut feelings.

> AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.

This is really bad reasoning for a few reasons:

1) We've gotten much better at training LLMs since 2022. The negative impacts of AI slop in the training data certainly don't outweigh the benefits of orders of magnitude more parameters and better training techniques, but that doesn't mean they have no negative impact.

2) "Outperform" is a very loose term and we still have no real good answer for measuring it meaningfully. We can all tell that Gemini 2.5 outperforms GPT-4o. What's trickier is distinguishing between Gemini 2.5 and Claude 4. The expected effect size of slop at this stage would be on that smaller scale of differences between same-gen models.

Given that we're looking for a small enough effect size that we know we're going to have a hard time proving anything with data, I think it's reasonable to operate from first principles in this case. First principles say very clearly that avoiding training on AI-generated content is a good idea.

  • No, I mean "model" AIs, created explicitly for dataset testing purposes.

    You take small AIs, of the same size and architecture, and with the same pretraining dataset size. Pretrain some solely on skims from "2019 only", "2020 only", "2021 only" scraped datasets. The others on skims from "2023 only", "2024 only". Then you run RLHF, and then test the resulting AIs on benchmarks.

    The latter AIs tend to perform slightly better. It's a small but noticeable effect. Plenty of hypothesis on why, none confirmed outright.

    You're right that performance of frontier AIs keeps improving, which is a weak strike against the idea of AI contamination hurting AI training runs. Like-for-like testing is a strong strike.

    • I can understand that years before ChatGPT would not have any LLM-generated text, but how much does the year actually correlate with how much LLM text is in the dataset? Wouldn't special-purpose datasets with varying ratios of human and LLM text be better for testing effects of "AI contamination"?

      1 reply →

I don't think people have really got started on generating slop, I expect it to increase by a lot.