Comment by Analemma_
10 hours ago
I'm curious: has someone done a lengthy write-up of best practices to get good results out of AI security audits? It seems like it can go very well (as it did here) or be totally useless (all the AI slop submitted to HackerOne), and I assume the difference comes down to the quality of your context engineering and testing harnesses.
This post did a little bit of that but I wish it had gone into more detail.
OpenAI just released “codex security”, worth trying (along with other suggestions) if your org has access https://openai.com/index/codex-security-now-in-research-prev...
The HackerOne slop is because there's a financial incentive (bug bounties) involved, which means people who don't know what they are doing blindly submit anything that an LLM spots for them.
If you're running the security audit yourself you should be in a better position to understand and then confirm the issues that the coding agents highlight. Don't treat something as a security issue until you can confirm that it is indeed a vulnerability. Coding agents can help you put that together but shouldn't be treated as infallible oracles.
That sounds like the same problem (a deluge of slop) with a different interface (eating straight from the trough rather than waiting for someone to put a bow on it and stamp their name to it)?
Seems very similar to turning on compiler warnings. A load of scary nothings, and a few bugs. But you fix the bugs and clarify the false positives, and end up with more robust and maintainable code.
I've found it's pretty good. It's really not that much of a burden to dig through 10 reports and find the 2 that are legitimate.
It's different from Hacker One because those reports tend to come in with all sorts of flowery language added (or prompt-added) by people who don't know what they are doing.
If you're running the prompts yourself against your own coding agents you gain much more control over the process. You can knock each report down to just a couple of sentences which is much faster to review.
1 reply →
The question still is: will enough useful stuff be included, to make it worth to dig through the slop? And how to tune the prompt to get better results.
Best way to figure that out is to try it and see what happens.
3 replies →
That depends on how the tool is used. People who ask for a security vulnerability get slop. People who asked for deeper analysis often get something useful - but it isn't always a vulnerability.
I assume it's just like asking for help refactoring, just targeting specific kinds of errors.
I ran a small python script that I made some years ago through an LLM recently and it pointed out several areas where the code would likely throw an error if certain inputs were received. Not security, but flaws nonetheless.
You're either digging through slop or digging through your whole codebase anyway.
We split our work:
* Specification extraction. We have security.md and policy.md, often per module. Threat model, mechanisms, etc. This is collaborative and gets checked in for ourselves and the AI. Policy is often tricky & malleable product/business/ux decision stuff, while security is technical layers more independent of that or broader threat model.
* Bug mining. It is driven by the above. It is iterative, where we keep running it to surface findings, adverserially analyze them, and prioritize them. We keep repeating until diminishing returns wrt priority levels. Likely leads to policy & security spec refinements. We use this pattern not just for security , but general bugs and other iterative quality & performance improvement flows - it's just a simple skill file with tweaks like parallel subagents to make it fast and reliable.
This lets the AI drive itself more easily and in ways you explicitly care about vs noise
No mention of the quality of the engineers reviewing the result?