← Back to context

Comment by kesor

2 months ago

Diffusion models for images are already pretty much binary code generators. And we don't need to treat each bit individually, even in binary code there are whole segments that can be tokenized into a single token.

Regarding training, we have many binaries all around us, for many of them we also have the source code in whichever language. As a first step we can use the original source code and ask a third party model to explain what it does in English. Then use this English to train the binary programmer model. Eventually the binary programmer model can understand binaries directly and translate them to English for its own use, so with time, we might not even need binaries that have source code, we could narrate binaries directly.