← Back to context

Comment by AnthonyMouse

1 day ago

Now I'm kind of curious if you give an LLM the disassembly of a proprietary firmware blob and tell it to turn it into human-readable source code, how good is it at that?

You could probably even train one to do that in particular. Take existing open source code and its assembly representations as training data and then treat it like a language translation task. Use the context to guess what the variable names were before the original compiler discarded them etc.

The most difficult parts of getting readable code would be dealing with inlined functions and otherwise-duplicated code from macros or similar, and dealing with in-memory structure layouts; both pretty complicated very-global tasks. (never mind naming things, but perhaps LLMs have a good shot at that)

That said, chatgpt currently seems to fail even basic things - completely missed the `thrM` path being possible here: https://chatgpt.com/share/69296a8e-d620-800b-8c25-15f4260c78... https://dzaima.github.io/paste/#0jZJNTsMwEIX3OcWoSFWCqrhN0wb... and that's only basic bog-standard branching, no in-memory structures or stack usage (such trivial problems could be handled by using an actual proper disassembler before throwing an LLM at that wall, but of course that only solves the easy part)

Should be possible. A couple of years ago I used an earlier ChatGPT model to understand and debug some ARM assembly, which I'm not personally very familiar with.

I can imagine that a process like what you describe, where a model is trained specifically on .asm / .c file pairs, would be pretty effective.