Comment by uncircle

5 days ago

> Does this mean that eventually in a world where we all use this stuff, no new language/framework/library will ever be able to emerge?

That's a very good question.

Rephrased: as good training data will diminish exponentially with the Internet being inundated by LLM regurgitations, will "AI savvy" coders prefer old, boring languages and tech because there's more low-radiation training data from the pre-LLM era?

The most popular language/framework combination in early 2020s is JavaScript/React. It'll be the new COBOL, but you won't need an expensive consultant to maintain in the 2100s because LLMs can do it for you.

Corollary: to escape the AI craze, let's keep inventing new languages. Lisps with pervasive macro usage and custom DSLs will be safe until actual AGIs that can macroexpand better than you.

> Rephrased: as good training data will diminish exponentially with the Internet being inundated by LLM regurgitations

I don't think the premise is accurate in this specific case.

First, if anything, training data for newer libs can only increase. Presumably code reaches github in a "at least it compiles" state. So you have lots of people fight the AIs and push code that at least compiles. You can then filter for the newer libs and train on that.

Second, pre-training is already mostly solved. The pudding seems to be now in post-training. And for coding a lot of post-training is done with RL / other unsupervised techniques. You get enough signals from using generate -> check loops that you can do that reliably.

The idea that "we're running out of data" is way too overblown IMO, especially considering the last ~6mo-1y advances we've seen so far. Keep in mind that the better your "generation" pipeline becomes, the better will later models be. And the current "agentic" loop based systems are getting pretty darn good.

  • > First, if anything, training data for newer libs can only increase.

    How?

    Presumably in the "every coder is using AI assistants" future, it will be an incredible amount of friction to get people to adopt languages that AI assistants don't know anything about

    So how does the training data for a new language get made, if no programmers are using the language, because the AI tools that all programmers rely on aren't trained on the language?

    The snake eating its own tail

    • You can code today with new libs, you just need to tell the model what to use. Things like context7 work, or downloading docs, llms.txt or any other thing that will pop up in the future. The idea that LLMs can only generate what they were trained on is like 3 years old. They can do pretty neat things with stuff in context today.

      1 reply →