At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.
With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.
Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.
> it maybe doesn't even matter that the models are using older data.
This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.
Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?
That's a different problem than I thought you were worried about. I wasn't considering the marketing angle, though that is certainly relevant and a risk to consider, especially when it comes to Google, whose primary businesses are ads and surveillance.
LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.
Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.
It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.
So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.
But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.
If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.
The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.
At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.
With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.
Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.
> it maybe doesn't even matter that the models are using older data.
This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.
Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?
That's a different problem than I thought you were worried about. I wasn't considering the marketing angle, though that is certainly relevant and a risk to consider, especially when it comes to Google, whose primary businesses are ads and surveillance.
Can you explain what you mean?
LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.
Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.
It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.
So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.
Considering all models can use search engines, is this really relevant?
4 replies →
But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.
If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.
The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.
1 reply →
It might indicate core model training and pre training is really slowing down?
also parsing is harder + so much more of the new data is being generated by ai itself.
still the cutoff is very much concerning and inconvenient
I thought that was a choice that Google made?
you really shouldn't have them pulling facts from their weights, they need grounding from real data sources