Comment by koushikn
3 months ago
Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?
I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.
Wouldn't all binaries be in the training data, rather than the context? And output context could be in pieces, with something concatenating the pieces into a working binary?
ChatGPT claims its possible, but not allowed due to OpenAI safety rules: https://chatgpt.com/share/68fb0a76-6bf8-800c-82f7-605ff9ca22...
Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.
I can attest that existing LLMs work surprisingly well for disassembly.