Comment by SoKamil

3 months ago

How can it compete with vast amount of trained codebases on Github? For LLMs, more data equals better results, so people will naturally be driven to better completion with already established frameworks and languages. It would be hard to produce organic data on all ways your technology can be (ab)used.

Allegedly one of the ways they've been training LLMs to get better at logic and reasoning, as well as factual accuracy, is to use LLMs themselves to generate synthetic training data. The idea here would be similar: generate synthetic training data. Generating this could be aided by LLMs, perhaps with a "playground" of some sort where LLMs could compile / run / render various things, to help select out things that work and things that don't work (as well as if you see error X, what the problem might be).

In a similar way to how this works for natural languages. Turns out that if you train the model on e.g. vast quantities of English, teaching it other languages doesn't require nearly as much, because it has already internalized all the "shared" parts (and there's a lot more of those than there are surface differences).

But, yes, it does mean that new things that are drastic breaks with old practices are much harder to teach compared to incremental improvements.