Comment by reitzensteinm

14 days ago

But coding is largely trained on synthetic data.

For example, Claude can fluently generate Bevy code as of the training cutoff date, and there's no way there's enough training data on the web to explain this. There's an agent somewhere in a compile test loop generating Bevy examples.

A custom LLM language could have fine grained fuzzing, mocking, concurrent calling, memoization and other features that allow LLMs to generate and debug synthetic code more effectively.

If that works, there's a pathway to a novel language having higher quality training data than even Python.

1 comment

reitzensteinm

mbreese 14 days ago

I recently had Codex convert an script of mine from bash to a custom, Make inspired language for HPC work (think nextflow, but an actual language). The bash script submitted a bunch of jobs based on some inputs. I wanted this converted to use my pipeline language instead.

I wrote this custom language. It's on Github, but the example code that would have been available would be very limited.

I gave it two inputs -- the original bash script and an example of my pipeline language (unrelated jobs).

The code it gave me was syntactically correct, and was really close to the final version. I didn't have to edit very much to get the code exactly where I wanted it.

This is to say -- if a novel language is somewhat similar to an existing syntax, the LLM will be surprisingly good at writing it.