Comment by alansaber

3 days ago

I think not if only for the fact that the quantity of old data isn't enough to train anywhere near a SoTA model, until we change some fundamentals of LLM architecture

Are you saying it wouldn't be able to converse using english of the time?

  • Machine learning today requires an obscene quantity of examples to learn anything.

    SOTA LLMs show quite a lot of skill, but they only do so after reading a significant fraction of all published writing (and perhaps images and videos, I'm not sure) across all languages, in a world whose population is 5 times higher than the link's cut off date, and the global literacy went from 20% to about 90% since then.

    Computers can only make up for this by being really really fast: what would take a human a million or so years to read, a server room can pump through a model's training stage in a matter of months.

    When the data isn't there, reading what it does have really quickly isn't enough.

  • That's not what they are saying. SOTA models include much more than just language, and the scale of training data is related to its "intelligence". Restricting the corpus in time => less training data => less intelligence => less ability to "discover" new concepts not in its training data

    • Could always train them on data up to 2015ish and then see if you can rediscover LLMs. There's plenty of data.

I mean, humans didn't need to read billions of books back then to think of quantum mechanics.

  • Which is why I said it's not impossible, but current LLM architecture is just not good enough to achieve this.