Comment by ACCount36

8 months ago

No, I mean "model" AIs, created explicitly for dataset testing purposes.

You take small AIs, of the same size and architecture, and with the same pretraining dataset size. Pretrain some solely on skims from "2019 only", "2020 only", "2021 only" scraped datasets. The others on skims from "2023 only", "2024 only". Then you run RLHF, and then test the resulting AIs on benchmarks.

The latter AIs tend to perform slightly better. It's a small but noticeable effect. Plenty of hypothesis on why, none confirmed outright.

You're right that performance of frontier AIs keeps improving, which is a weak strike against the idea of AI contamination hurting AI training runs. Like-for-like testing is a strong strike.

2 comments

ACCount36

HanayamaTriplet 8 months ago

I can understand that years before ChatGPT would not have any LLM-generated text, but how much does the year actually correlate with how much LLM text is in the dataset? Wouldn't special-purpose datasets with varying ratios of human and LLM text be better for testing effects of "AI contamination"?

ACCount36 8 months ago

Not if the goal is to test quality of real datasets, and that was the goal.
Getting this weird information about newer datasets generally outperforming older datasets was more of a side effect of having a dataset evaluation system.
If you're trying to examine AI contamination specifically? There are many variables, and trying to capture them all in a laboratory dataset is rather involved.
For one, AI data out in the wild is "enriched" - it's very likely to be selected by users before being published (human feedback best of 4?), it can gather human interaction like likes/comments, it's more likely to get spread around if it's novel/amusing/high quality than it is if it's low quality, generic and bland. How do you replicate that in a lab setup? On a tight budget?