Comment by koushikn

3 months ago

Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?

I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.

Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.