← Back to context

Comment by hosel

10 hours ago

Can you explain what you mean?

LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

  • It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.

    So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.

  • Considering all models can use search engines, is this really relevant?

    • This is not meant as an insult, but have you actually LLM/vibe coded anything that used a fast(-ish) moving library or framework? Try asking your favorite LLM with say Jan 2025 knowledge cutoff (or pretraining data cutoff, whatever you want to call it) to work on something using a framework that had a big rewrite later that year (which would make it one year old now, which is like ages in the LLM coding era)... It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda long story short down the thread when context runs out and/or is compressed it begins to forget detailed instructions and just falls back to pulling out old patterns it "remembers" from pretraining. And so you need to constantly remind it what you work with and "oh hey this doesnt work because we're working with react router v7 in framework mode, remember? not react router v6". Or try to use the latest non-lts/breaking version of a library, at first it looks it up online, but again as you get deeper into the weeds and little details, the struggle begins.

      So, as far as I'm concerned, training cutoff is still a big deal.

      1 reply →

    • Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.

      If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.

      The answer is: without being in the training data, LLMs basically don't understand what they're searching for.

      1. https://github.com/tirrenotechnologies/tirreno

      1 reply →

  • But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

    If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

    The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.

    • Looking at token usage at places like OpenRouter as a proxy for overall production we're looking at exponential growth in AI-created content. Weekly token usage there has tripled just in the past 3 months.

It might indicate core model training and pre training is really slowing down?

  • also parsing is harder + so much more of the new data is being generated by ai itself.

    still the cutoff is very much concerning and inconvenient