Comment by NitpickLawyer

5 days ago

> Rephrased: as good training data will diminish exponentially with the Internet being inundated by LLM regurgitations

I don't think the premise is accurate in this specific case.

First, if anything, training data for newer libs can only increase. Presumably code reaches github in a "at least it compiles" state. So you have lots of people fight the AIs and push code that at least compiles. You can then filter for the newer libs and train on that.

Second, pre-training is already mostly solved. The pudding seems to be now in post-training. And for coding a lot of post-training is done with RL / other unsupervised techniques. You get enough signals from using generate -> check loops that you can do that reliably.

The idea that "we're running out of data" is way too overblown IMO, especially considering the last ~6mo-1y advances we've seen so far. Keep in mind that the better your "generation" pipeline becomes, the better will later models be. And the current "agentic" loop based systems are getting pretty darn good.

4 comments

NitpickLawyer

bluefirebrand 5 days ago

> First, if anything, training data for newer libs can only increase.

How?

Presumably in the "every coder is using AI assistants" future, it will be an incredible amount of friction to get people to adopt languages that AI assistants don't know anything about

So how does the training data for a new language get made, if no programmers are using the language, because the AI tools that all programmers rely on aren't trained on the language?

The snake eating its own tail

NitpickLawyer 5 days ago
You can code today with new libs, you just need to tell the model what to use. Things like context7 work, or downloading docs, llms.txt or any other thing that will pop up in the future. The idea that LLMs can only generate what they were trained on is like 3 years old. They can do pretty neat things with stuff in context today.
- bluefirebrand 5 days ago
  
  The context would have to be massive in order to ingest an entire new programming language and associated design patterns, best practices and such wouldn't it?
  I'm not an expert here by any means but I'm not seeing how this makes much sense versus just using languages that the LLM is already trained on
Leynos 5 days ago

Synthetic training data presumably.