Comment by stared
3 days ago
One of the authors here.
The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.
For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://news.ycombinator.com/item?id=46846101.
There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.
have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?
No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.
Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...
Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:
> However, [the approach of using AI agents for malware detection] is not ready for production.
Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."
6 replies →