Potential issues in curl found using AI assisted tools

4 months ago (mastodon.social)

https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

https://joshua.hu/files/AI_SAST_PRESENTATION.pdf

202 comments

robhlam

This is exactly what I'd want from an 'AI coding companion'.

Don't write or fix the code for me (thanks but I can manage that on my own with much less hassle), but instead tell me which places in the code look suspicious and where I need to have a closer look.

When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.

ChatGPT is even less useful since it basically just spend a lot of time to tell me 'everything looking great yay good job high-five!'.

So far, traditional static code analysis has been much more helpful in finding actual bugs, but static analysis being clean doesn't mean there are no logic bugs, and this is exactly where LLMs should be able to shine.

If getting more useful potential-bugs-information from LLMs requires an extensively customized setup then the whole idea is getting much less useful - it's a similar situation to how static code analysis isn't used if it requires extensive setup or manual build-system integration instead of just being a button or menu item in the IDE or enabled by default for each build.

Sharlin 4 months ago
This is a point I see discussed surprisingly little. Given that many (most?) programmers like designing and writing code (excluding boilerplate), and not particularly enjoy reviewing code, it certainly feels backwards to make the AI write the code and relegate the programmer to reviewing it. (I know, of course, that the whole thing is being sold to stakeholders as "LoC machine goes brrrr" – code review? what's that?)
- SAI_Peregrinus 4 months ago
  
  Creativity is fun. AIs automate that away. I want an AI that can do my laundry, fold it, and put it away. I don't need an AI to write code for me. I don't mind AI code review, it sometimes has a valid suggestion, and it's easy enough to ignore most of the rest of the time.
  
  41 replies →
- dylan604 4 months ago
  
  To me, it's the natural result of gaining popularity that enough people have started to use after the hype train rolled through and are now giving honest feedback. Real honest feedback can feel like a slap in the face when all you have had is overwhelming positive feedback from those aboard the hype train.
  The writing has been on the wall with so called hallucinations where LLMs just make stuff up that the hype was way out over its skiis. The examples of lawyers being fined for unchecked LLM outputs being presented as fact type of stories will continue to take the shine off and hopefully some of the raw gungho nature will slow down a bit.
  
  5 replies →
- alexchantavy 4 months ago
  
  There are a lot of good AI code reviewers out there where they learn project conventions based on prior PRs and make rules from them. I've found they definitely save time and catch things I would have missed - things like cubic.dev or greptile etc etc. Especially helpful for running an open source project where code quality can have high variance and as a maintainer you may feel hesitant to be direct with someone -- the machine has no feelings so it is what it is :)
- ratelimitsteve 4 months ago
  
  honestly? this but zoom out. machines are supposed to do the grunt work so that people can spend their time being creative and doing intangible, satisfying things but we seem to have built machines to make art, music and literature in order to free ourselves up to stack bricks and shovel manure.
- jes5199 4 months ago
  
  codex can actually do useful reviews on pull requests, as of the last few weeks
CharlesW 4 months ago
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
Here's a technique that often works well for me: When you get unexpectedly poor results, ask the LLM what it thinks an effective prompt would look like, e.g. "How would you prompt Claude Code to create a plan to effectively review code for logic bugs, ignoring things like FIXME and TODO comments?"
The resulting prompt is too long to quote, but you can see the raw result here: https://gist.github.com/CharlesWiltgen/ef21b97fd4ffc2f08560f...
From there, you can make any needed improvements, turn it into an agent, etc.
- OGWhales 4 months ago
  
  I've found this a really useful strategy in many situations when working with LLMS. It seems odd that it works, since one one think its ability to give a good reply to such a question means it already "understands" your intent in the first place, but that's just projecting human ability onto LLMS. I would guess this technique is similar to how reasoning modes seems to improve output quality, though I may misunderstand how reasoning modes work.
  
  1 reply →
- einarfd 4 months ago
  
  This is a great idea, and worth doing. An other option in Claude code, that can be worth trying, is the planning mode, which you start with ctrl+tab. Have it plan out what it's going to do, and keep iterating it, until the plan seems sound. Tbh. I wish I've found the planning mode earlier, it's been such a great help.
- yencabulator 4 months ago
  
  If you don't edit this prompt, how is it any better than the LLM generating the same context for itself on the fly in "thinking mode"?
- alickz 4 months ago
  
  I have also had some success with this method
  I asked ChatGPT to analyze its weaknesses and give me a pre-prompt to best help mitigate them and it gave me this: https://pastebin.com/raw/yU87FCKp
  I've found it very useful to avoid sycophancy and increase skepticism / precision in the replies it gives me
vidarh 4 months ago

I've "worked" with Claude Code to find a long standing set of complex bugs over the last couple of days, and it can do so much more. It's come up with hypotheses, tested them, used gdb in batch mode when the hypotheses failed in order to trace what happened at the assembly level, and compared with the asm dump of the code in question.
It still needs guidance, but it quashed bugs yesterday that I've previously spent many days on without finding a solution for.
It can be tricky, but they definitely can be significant aid for even very complex bugs.
walthamstow 4 months ago

Cursor BugBot is pretty good for this, we did the free trial and it was so popular with our devs that we ended up keeping it. Occasional false positives aside, it's very useful. It saves time for both the PR submitter and the reviewer.
freedomben 4 months ago

I've had reasonably good success with asking Claude things like: "There's a bug somewhere that is causing slow response times on several endpoints, including <xyz>. Sometimes response times can get to several seconds long, and don't look correlated with CPU or memory usage. Database CPU and memory also don't seem to correlate. What is the issue?" I have to iterate a few times but it's hinted me a few really tricky issues that would have probably taken hours to find.
Definitely optimistic for this way to use AI
flaviolivolsi 4 months ago

I found GPT-5 to be very much less sycophantic than other models when it comes to this stuff, so your mention of 'everything looking great yay good job high-five' surprises me. Using it via Codex CLI it often questions things. Gemini 2.5 Pro is also good on this.
mhitza 4 months ago
In an application I'm working on, I use gpt-oss-20B. In a prompt I dump in the OWASP Top 10 web vulnerabilities, and a note that it should only comment on "definitive vulnerabilities". Has been pretty effective in finding vulnerabilities in the code I write (and it's one of the poorest-rated models if you look at some comments).
Where I still need to extend this, is to introduce function calling in the flow, when "it has doubts" during reasoning, would be the right time to call out a tool that would expand the context its working with (pull in other files, etc).
- diggan 4 months ago
  
  > (and it's one of the poorest-rated models if you look at some comments).
  Yeah, don't listen to "wisdom of the crowd" when it comes to LLM models, there seems to be a ton of fud going on, especially on subreddits.
  GPT-OSS was piled on for being dumb in the first week of release, yet none of the software properly supported it at launch. As soon as it was working properly in llama.cpp, it was clear how strong the model was, but at that point the popular sentiments seems to have spread and solidified.
- airstrike 4 months ago
  
  Tool calling is the best lever for getting value out of LLMs
KronisLV 4 months ago

> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
I explicitly asked it to read all the code (within Cline) and it did so, gave me a dozen action items by the end of it, on a Django project. Most were a bit nitpicky, but two or three issues were more serious. I found it pretty useful!
fhd2 4 months ago

My thoughts exactly. So many actually useful tools could be built on top of LLMs, but most of the resources go into the no code space.
I get it though, non programmers or weak programmers don't scrutinise the results and are more likely to be happy to pay. Still, bit of a shame.
Maybe these tools exist, but at least to me, they don't surface among all the noise.
jiggawatts 4 months ago

Because most LLMs are just a REST call, it’s a trivial matter to wire them up in a loop over all source files. The fiddly part is finding a good library for enumerating files while adhering to .gitignore and/or “project” file include/exclude rules!
Even very simple prompts can yield very useful outputs.
“Report each bug you spot in this code with a markdown formatted report.” worked better than I expected.
It costs just a couple of dollars to scan through an entire codebase with something like Gemini Flash.
trenchpilgrim 4 months ago

I use Zed's "Ask" mode for this all the time. It's a read only mode where the LLM focuses on figuring out the codebase instead of modifying it. You can toggle it freely mid conversation.
llleeeooo 4 months ago

Indeed, in many machine learning models, classification is always easier than generation. Maybe that's consistent with chatgpts intelligence level
simonw 4 months ago

Suggestion: run a regex to remove those FIXME comments first, then try the experiment again.
I often use Claude/GPT-5/etc to analyze existing repositories while deliberately omitting the tests and documentation folders because I don't want them to influence the answers I'm getting about the code - because if I'm asking a question it's likely the documentation has failed to answer it already!
notatoad 4 months ago

i've had great success with both chatGPT and claude with the prompt "tell me how this sucks" or "why is this shit". being a bit more crass seems to bump it out of the sycophantic mode, and being more open-ended in the type of problems you want it to find seems to yield better results.
but i've been limiting it to a lot less than 20k LoC, i'm sticking with stuff i can just paste into the chat window.
oezi 4 months ago

Really surprised that nobody in this thread mentions using Gemini 2.5 Pro. Its 1m context really shines for code review.
GPT 5 has been disappointing with thinking and without.

chmod775 4 months ago

I really didn't expect a story about curl and AI to be positive for once.

Some history: https://hn.algolia.com/?q=curl+AI

dwedge 4 months ago
Yeah this is really fair play to Daniel Stenberg that he still approached these AI generated bug reports with an open mind after all the problems he's had.
- athorax 4 months ago
  
  I think the big difference is that these aren't AI generated bug reports. They are bugs found with the assistance of AI tools that were then properly vetted and reported in a responsible way by a real person.
  
  4 replies →
- abrookewood 4 months ago
  
  Yep, I feel for the guy. He's had to deal with a hell of a lot of frustrating crap from AI slop to crazy end-users. Kudos for staying on top of it.

mmsc 4 months ago

Always fun to wake up (ok; I didn't wake up, I got off a 10 hour flight) to see my work on the front page of hn.

I'll be doing a retrospective in a few weeks when the dust has settled, as well as new tools I've been made aware of.

gavinray 4 months ago
I thoroughly enjoyed the post, one of the few lengthier blog posts I read start-to-finish.
Seems like ZeroPath might be worth looking into if the price is reasonable
- mmsc 4 months ago
  
  Thank you, it means a lot.
polycaster 4 months ago

I’m curious what tools that may be.

simonw 4 months ago

Here are 55 closed PRs in the curl repo which credit "sarif data" - I think those are the ones Daniel is talking about here https://github.com/curl/curl/pulls?q=is%3Apr+sarif+is%3Aclos...

This is notable given Daniel Stenberg's reports of being bombarded by total slop AI-generated false security issues in the past: https://www.linkedin.com/posts/danielstenberg_hackerone-curl...

Concerning HackerOne: "We now ban every reporter INSTANTLY who submits reports we deem AI slop. A threshold has been reached. We are effectively being DDoSed. If we could, we would charge them for this waste of our time"

Also this from January 2024: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

tomjakubowski 4 months ago

Some of those bugs, like using the wrong printf-specifier for a size_t, would be flagged by the compiler with the right warning flags set. An AI oracle which tells me, "your project is missing these important bug-catching compiler warning flags," would be quite useful.
A few of these PRs are dependabot PRs which match on "sarif", I am guessing because the string shows up somewhere in the project's dependency list. "Joshua sarif data" returns a more specific set of closed PRs. https://github.com/curl/curl/pulls?q=is%3Apr+Joshua+sarif+da...
octocop 4 months ago
The models used have improved quite well since then, I guess his change of opinion shows that.
- Twirrim 4 months ago
  
  No, he's still dealing with a flood of crap, even in the last few weeks, off more modern models.
  It's primarily from people just throwing source code at an LLM, asking it to find a vulnerability, and reporting it as-read, without having any actual understanding of if it is or isn't a vulnerability.
  The difference in this particular case is it's someone who is: 1) Using tools specifically designed for security audits and investigations. 2) Takes the time to read and understand the vulnerability reported, and verifies that it is actually a vulnerability before reporting.
  Point 2 is the most significant bar that people are woefully failing to meet and wasting a terrific amount of his time. The one that got shared from a couple of weeks ago https://hackerone.com/reports/3340109 didn't even call curl. It was straight up hallucination.
- simonw 4 months ago
  
  I think it's more about how people are using it. An amateur who spams him with GPT-5-Codex produced bug reports is still a waste of his time. Here a professional ran the tools and then applied their own judgement before sending the results to the curl maintainers.
  
  3 replies →
- whizzter 4 months ago
  
  It's probably also the difference of idiots hoping to cash out/get credit for vulnerabilities by just throwing ChatGPT at the wall compared to this where it seems a somewhat seasoned researcher is trialing more customized tools.

cubefox 4 months ago

This should probably link to the original blog post by Joshua Rogers:

https://joshua.hu/llm-engineer-review-sast-security-ai-tools... ("Hacking with AI SASTs: An overview of 'AI Security Engineers' / 'LLM Security Scanners' for Penetration Testers and Security Teams")

simonw 4 months ago

The PDF slide deck that accompanies that post includes some screenshots of the tools that were used: https://joshua.hu/files/AI_SAST_PRESENTATION.pdf
dang 4 months ago

Thanks—we've added that link to the toptext above.

blixt 4 months ago

It wasn't immediately obvious to me what the AI tools were? He mentioned that multiple other tools failed to find anything, so I'm very curious to hear what made this strategy so superior.

NooneAtAll3 4 months ago

there's a blog link https://joshua.hu/llm-engineer-review-sast-security-ai-tools... that has Products chapter
I guess mastodon link is simply a confirmation that bugs were indeed bugs, even with wrong code snippets?

ambicapter 4 months ago

I'd love to know how an 'AI-native' SAST uses AI under the hood but I supposed that's a trade secret.

tempodox 4 months ago

Now that is how LLM assistance for coding can be useful. Would be interesting to know which set of tools was used exactly. How might one reproduce this kind of assistance for other code bases?

simonw 4 months ago

See Joshua's post for details: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
Tools included ZeroPath, Corgea and Almanax.
excitedrustle 4 months ago

[dead]

shivasurya 4 months ago

Love this take actually and have been working on this and published this way back 2023/2024. Recently, I've been inspired by Claude-code & Cline agentic flow + tool looping, I experimented the same with tools like file_read, dir_list and throwing in few sast tools, security prompts on Wordpress plugin ecosystem (say with 10k-100k active installation) and scanned around ~600 and to my surprise it yielded ~45 critical, ~120 high severity issues and accounting 20% for non-reachability vuln. Spent around 6$ and ~40 million tokens with grok-4 fast reasoning model and the results were impressive, I gave a try with claude-sonnet but significantly rate-limited despite having 50$ credits from anthropic for research.

You can read about my experience here: https://codepathfinder.dev/blog/introducing-secureflow-cli-t...

Old post: https://shivasurya.me/security-reviews/sast/2024/06/27/autom...

lowbloodsugar 4 months ago

When I read “we consider nread == 0 as reading a byte and we shouldn’t” I immediately think of all the things that look like bugs but are there because some critical piece of infrastructure relies on that behavior. AI isn’t going to know about that unless you tell it, and the problem is that there’s plenty of folks who have job security precisely because they don’t write that down.

bgwalter 4 months ago

If something is found by Valgrind, we can reproduce it ourselves. Here we get private bug reports found by "his set of AI assisted tools".

The set seems to be:

https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

So he likes ZeroPath. Does that get us any further? No, the regular subscription costs $200 and the free one-time version looks extremely limited and requires yet another login.

Also of course, all low hanging fruit that these tools detect will be found quickly in open source (provided that someone can afford a subscription), similar to the fact that oss-fuzz has diminishing returns.

simonw 4 months ago

Presumably the bug reports were private because some of them might relate to curl security.
You can see the fixes that resulted from this in the PRs that mention "sarif" in the curl repository: https://github.com/curl/curl/pulls?q=is%3Apr+sarif+is%3Aclos...

yapyap 4 months ago

Yes the AI gave him leads and a talented programmer still has to follow up on them one by one.

It’s like a police facial recognition, they can help police but there is no way they are “replacing police”

noelwelsh 4 months ago

More interesting to me is how to stop these bugs from occurring in the first place. The example given in the thread is the kind of bug that C (and mutation) excels at creating.

swaits 4 months ago

And how many would’ve been avoided by finishing the rust port?
viraptor 4 months ago

The linked blog post https://joshua.hu/llm-engineer-review-sast-security-ai-tools... shows that most of the used tools can be run in ci and comment on the PRs.

ronsor 4 months ago

To anyone who thinks about our current situation for than a few minutes, AI is

* Clearly useful to people who are already competent developers and security researchers

* Utterly useless to people who have no clue what they're doing

But the latter group's incompetency does not make AI useless in the same way that a fighter jet is not useless because a toddler cannot pilot it.

add-sub-mul-div 4 months ago

The metaphor is more apt if you change fighter jets to road vehicles, which are driven by most of the population and whose incompetent use of which can very much affect you.
Imagine what your doctors will be like two generations down the road.
darkwater 4 months ago

This is currently true. There was a perfect approach of the other side of the coin with curl again a few weeks ago (https://daniel.haxx.se/blog/2025/07/14/death-by-a-thousand-s...)
gopalv 4 months ago
> Clearly useful to people who are already competent developers
> Utterly useless to people who have no clue what they're doing
> the same way that a fighter jet is not useless
AI is currently like a bicycle, while we were all running hills before.
There's a skill barrier and getting less complicated each week.
The marketing goal is to say "Push the pedal and it goes!" like it was a car on a highway, but it is a bicycle, you have to keep pedaling.
The effect on the skilled-in-something-else folks is where this is making a difference.
If you were thinking of running, the goal was to strengthen your tendons to handle the pavement. And a 2hr marathon pace is almost impossible to do.
Like a bicycle makes a <2hr marathon distance "easy" for someone who does competitive rowing, while remaining impossible for those who have been training to do foot races forever.
Because the bicycle moves the problem from unsprung weights and energy recovery into a VO2 max problem, also into a novel aerodynamics problem.
And if you need to walk a rock garden, now you need to lug the bike too with you. It is not without its costs.
This AI thing is a bicycle for the mind, but a lot of people go only downhill and with no brakes.
- belZaah 4 months ago
  
  True, AI moves the problem somewhere else. But I’m not sure the new problems are actually easier to solve in the long run.
  I’m a reasonable developer with 30+ years of experience. Recently I worked on an API design project and had to generate a mock implementation based on a full openapi spec. Exactly what Copilot would be good at. No amount of prompting could make it generate a fully functional spring-boot project doing both the mock api and present the spec at a url at the same time. Yet it did a very neat job at just the mock for a simpler version of the same api a few weeks prior. Go figure.
- rkomorn 4 months ago
  
  Solid metaphor. No notes.
Terr_ 4 months ago

Though it might be bad news for the companies that got big by saying they'd be able to make infinite money selling fighter-jets to all the children of the world.
ranger_danger 4 months ago
> “A good tool in the hands of a competent person is a powerful combination,” says Daniel Stenberg.
- 0points 4 months ago
  
  Daniel Stenberg has been vocal about AI generated patches in the past, and it's interesting to see him changing course here.
  https://media.ccc.de/v/froscon2025-3407-ai_slop_attacks_on_t...
  
  1 reply →
the_jeremy 4 months ago
> * Utterly useless to people who have no clue what they're doing
I disagree.
I'm making a board game of 6 colors of hexes, and I wanted to be able to easily edit the board. The first time around, I used a screenshot of a bunch of hexagons and used paint to color them (tedious, ugly, not transparent, poor quality). This time, I asked ChatGPT to make an SVG of the board and then make a JS script so that clicking on a hex could cycle through the colors. Easier, way higher quality, adjustable size, clean, transparent.
It would've taken me hours to learn and set that up for myself, but ChatGPT did it in 10min with some back and forth. I've made one SVG in my life before this, and never written any DOM-based JS scripts.
Yes, it's a toy example, but you don't have to knwo what you're doing to get useful things from AI.
- arscan 4 months ago
  
  > but ChatGPT did it in 10min with some back and forth
  You might be underestimating the expertise you applied in these 10 minutes. I know I often do.
  > it's a toy example
  This technology does exceptionally well on toy examples, I think because there are much fewer constraints on acceptable output than ‘real’ examples.
  > you don't have to knwo what you're doing to get useful things from AI
  You do need to know what is useful though, which can be a surprisingly high bar.
- snovymgodym 4 months ago
  
  Yeah you clearly don't have "no clue" of what you're doing in this example though.
  You're someone who knows the difference between a PNG and an SVG, knows enough Javascript to know that "DOM-based" JS is a thing, and has presumably previously worked in software/IT.
  You're smart enough to know things, and you're also smart enough to know there's a lot that you don't know.
  That's a far cry from the way a lot of laypeople, college kids, and fully nontechnical people try to use LLMs.
- casenmgreen 4 months ago
  
  It seems to me then you then did not learn what you would otherwise have learned, and so did not improve receive the critical thinking and general halo of knowledge improvements which come with learning.
- thedougd 4 months ago
  
  You sound at least somewhat experienced. You knew you wanted an SVG and that Javascript could be inserted into it. That's a pretty reasonable design starting point.
- stavros 4 months ago
  
  I agree AI is not "utterly useless", but its usefulness is extremely limited. If it writes all of the code for you, it tends to get into unmaintainable states very quickly, requiring manual review or guidance to overcome.

riedel 4 months ago

I am a bit worried about the abuse of those tools. I wonder if the tools have policies and mechanisms around those (no clue how, but like forced disclosure if they detect that scanned code is OSS, free usage for OSS teams). Seems otherwise like great tools to accelerate finding 0-zero days. Maybe ever worse, when building supply chain attacks, one could relatively easily test against the detection mechanism before contributing malicious code (wonder if they could detect and block malicious use/probing). However, i guess long term it will make our software more secure. I guess it is always an arms race.

silisili 4 months ago

> I have already landed 22(!) bugfixes thanks to this, and I have over twice that amount of issues left to go through

Sounds like it was a lot more than 22, assuming most are valid.

pwnfunction 4 months ago

do you think the "bug" exist in the latent space? i don't think all bugs does.. the bugs exist as long as variant exist in the trained weights. until we have some kinda rl env for verifying bugs.. its never gonna work "well".

renox 4 months ago

Once Claude found a bug in my code but I had to explain the structure of the data. Then and only then it found the bug.

tiahura 4 months ago

Perhaps Anthropic, OpenAI, and Google could compete by auditing and monitoring the top projects?

retr0reg 4 months ago

I work in a ML security R&D startup called Pwno, we been working on specifically putting LLMs into memory security for the past year, we've spoken at Black Hat, and we worked with GGML (llama.cpp) on providing a continuous memory security solution by multi-agents LLMs.

Somethings we learnt alone the way, is that when it comes to specifically this field of security what we called low-level security (memory security etc.), validation and debugging had became more important than vulnerability discovery itself because of hallucinations.

From our trial-and-errors (trying validator architecture, security research methodology e.g., reverse taint propagation), it seems like the only way out of this problem is through designing a LLM-native interactive environment for LLMs, validate their findings of themselves through interactions of the environment or the component. The reason why web security oriented companies like XBOW are doing very well, is because how easy it is to validate. I seen XBOW's LLM trace at Black Hat this year, all the tools they used and pretty much need is curl. For web security, abstraction of backend is limited to a certain level that you send a request, it whether works or you easily know why it didn't (XSS, SQLi, IDOR). But for low-level security (memory security), the entropy of dealing with UAF, OOBs is at another level. There are certain things that you just can't tell by looking at the source but need you to look at a particular program state (heap allocation (which depends on glibc version), stack structure, register states...), and this ReACT'ing process with debuggers to construct a PoC/Exploit is what been a pain-in-the-ass. (LLMs and tool callings are specifically bad at these strategic stateful task, see Deepmind's Tree-of thoughts paper discussing this issue) The way I've seen Google Project Zero & Deepmind's Big Sleep mitigating this is through GDB scripts, but that's limited to a certain complexity of program state.

When I was working on our integration with GGML, spending around two weeks on context, tool engineering can already lead us to very impressive findings (OOBs); but that problem of hallucination scales more and more with how many "runs" of our agentic framework; because we're monitoring on llama.cpp's main branch commits, every commits will trigger a internal multi-agent run on our end and each usually takes around 1 hours and hundreds of agent recursions. Sometime at the end of the day we would have 30 really really convincing and in-depth reports on OOBs, UAFs. But because how costly to just validate one (from understanding to debugging, PoC writing...) and hallucinations, (and it is really expensive for each run) we had to stop the project for a bit and focus solving the agentic validation problem first.

I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more.

bwfan123 4 months ago

> I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more
Thanks for sharing your experience ! It correlates with this recent interview with Sutton [1]. That real intelligence is learning from feedback with a complex and ever changing environment. What an LLM does is to train on a snapshot of what has been said about that environment and operate on only on that snapshot.
[1] https://www.dwarkesh.com/p/richard-sutton

yapyap 4 months ago

Also, what an intense presentation style.

Red borders around every slide and very flashy images

srcreigh 4 months ago

Link should be updated to this

https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

dang 4 months ago

I've added that link to the toptext, but I can't quite tell which URL should be the starting point.

konart 4 months ago

Love this one:

https://mastodon.social/@icing@chaos.social/1152440641434357...

>tldr

>The code was correct, the naming was wrong.

daxfohl 4 months ago

So whereas previously, repo owners were getting flooded with AI-generated PRs that were complete slop, now they're going to be flooded with PRs that contain actual bugfixes. IDK which problem is worse!

qustrolabe 4 months ago

Oh so AI usage news could be positive after all. Not to undermine huge issue of slop reports spam, but I'm so happy to see something besides doomerism

imiric 4 months ago

"AI" tools can be very powerful once you approach them as what they are: very good pattern matchers and generators. This ability far surpasses anything a human could do. Detecting potential issues in software is a great application of the technology.
The key word is "potential", though. They're still wildly unpredictable and unreliable, which is why an expert human is required to validate their output.
The big problem is the people overhyping the technology, selling it as "AI", and the millions deluded by the marketing. Amidst the false advertising, uncertainty, and confusion, people are forced to speculate about the positive and negative impacts, with wild claims at both extremes. As usual, the reality is somewhere in the middle.

runningmike 4 months ago

There are some good SAST scanners and many bad commercial scanners.

Many people advocate for the use of AI technology for SAST testing. There are even people and companies that deliver SAST scanners based on AI technology. However: Most are just far from good enough.

In the best case scenario, you’ll only be disappointed. But the risk of a false sense of security is enormous.

Some strong arguments against AI scanners can be found on https://nocomplexity.com/ai-sast-scanners/

1970-01-01 4 months ago

Notice it was 'a set of tools'

They're using it correctly. It's a system of tools, not an autopilot.

Timshel 4 months ago

I did not read it, but this article from the contributor should contain more details: https://joshua.hu/llm-engineer-review-sast-security-ai-tools... (mentioned in https://mastodon.social/@bagder/115241413210606972).
tptacek 4 months ago
It's weird that the discussion has collapsed down to "autopilots" vs. "abstention". I'm thrilled to be converging on an understanding that it instead "people who understand what they're trying to do" vs. "vibe coders".
- nxobject 4 months ago
  
  In defense of the cynics, I get the impression in a situation where (a) there's a lot of company marketing hype in such a competitive market that begs cynicism, and (b) we're constantly learning the boundary of trained LLMs can actually do (and can't), as well as unusual emergent workflows, that really do make a difference.
ragnese 4 months ago
Well, that's how Mr. Stenberg described it, but he wasn't the one using them. I don't know how the contributor feels about his AI tool(s).
- mos_basik 4 months ago
  
  I haven't read it yet, but later in the mastodon thread, stenberg says "this is [the contributor's] (long) blog post on his work: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...".

Michael_Keller 4 months ago

[dead]

alganet 4 months ago

Something sounds fishy in this. Has these bugs really been found by AI? (I don't think they were).

If you read Corgea's (one of the products used) "whitepaper", it seems that AI is not the main show:

> BLAST addresses this problem by using its AI engine to filter out irrelevant findings based on the context of the application.

It seems that AI is being used to post-process the findings of traditional analyzers. It reduces the amount of false positives, increasing the yield quality of the more traditional analyzers that were actually used in the scan.

Zeropath seems to use similar wording like "AI-Enabled Triage" and expressions like "combining Large Language Models with AST analysis". It also highlights that it achieves less false positives.

I would expect someone who developed this kind of thing to setup a feedback loop in which the AI output is somehow used to improve the static analysis tool (writing new rules, tweaking existing ones, ...). It seems like the logical next step. This might be going on on these products as well (lots of in-house rule extensions for more traditional static analysis tools, written or discovered with help of AI, hence the "build with AI" headline in some of them).

Don't get me wrong, this is cool. Getting an AI to triage a verbose static analysis report makes sense. However, it does not mean that AI found the bugs. In this model, the capabilities of finding relevant stuff are still capped at the static analyzer tools.

I wonder if we need to pay for it. I mean, now that I know it is possible (at least in my head), it seems tempting to get open source tools, set them to max verbosity, and find which prompts they are using on (likely vanilla) coding models to get them to triage the stuff.

asadeddin 4 months ago

Hi there, I'm Ahmad, CEO at Corgea, and the author of the white paper. We do actually use LLMs to find the vulnerabilities AND triage findings. For the majority of our scanning, we don't use traditional static analysis. At the core of our engine is the LLM reading the line of code to find CWEs in them.
etlun 4 months ago
Hi, I'm Etienne, one of the cofounders @ ZeroPath.
We do not use traditional static analyzers; our engine was built from the ground up to use LLMs as a primitive. The issues ZeroPath identified in Joshua's post were indeed surfaced and triaged by AI.
If you're interested in how it works under the hood, some of the techniques are outlined here: https://zeropath.com/blog/how-zeropath-works
- alganet 4 months ago
  
  Hi! Thanks for the reply.
  Joshua describes it as follows: "ZeroPath takes these rules, and applies (or at least the debug output indicates as such) the rules to every .. function in the codebase. It then uses LLM’s ability to reason about whether the issue is real or not."
  Would you say that is a fair assessment of the LLM role in the solution?
simonw 4 months ago
Looks like you're reacting to the Hacker News title here, which is currently " Daniel Stenberg on 22 curl bugs found by AI and fixed"
That's an editorialized headline (so it may get fixed by dang and co) - if you click through to what Daniel Stenberg said he was more clear:
> Joshua Rogers sent us a massive list of potential issues in #curl that he found using his set of AI assisted tools.
AI-assisted tools seems right to me here.
- alganet 4 months ago
  
  If the title changes, it is still a valid critique of the tools, how they might work, and a possible way of getting them for free.
  Also, think about it: of course I read Joshua's report. Otherwise, how could I have known the names of the products he used?
  
  1 reply →
- robhlam 4 months ago
  
  It’s clear my attempt to keep the gist of what Daniel said while keeping under the title character count didn’t hit the mark.
  How would you have worded it?
  
  2 replies →
bgwalter 4 months ago
I suppose the downvoters all have subscriptions to the tools and know exactly how the tools work while leaving the rest of us in the dark.
Even Joshua's blog post does not clearly state which parts and how much is "AI". Neither does the pdf.
- refulgentis 4 months ago
  
  [flagged]
  
  28 replies →

kissgyorgy 4 months ago

[flagged]

redbell 4 months ago

Somehow related:

You did this with an AI and you do not understand what you're doing here: https://news.ycombinator.com/item?id=45330378

anonymars 4 months ago

Yeah, I'm quite confused. See also https://news.ycombinator.com/item?id=43907376 ("Curl: We still have not seen a valid security report done with AI help")

cadamsdotcom 4 months ago

AI is non-deterministic as we know.

That makes its results unpredictable.

So don’t have AI create your bugs.

Instead have your AI look for problems - then have it create deterministic tools and let tools catch the issues in a repeatable, understandable, auditable way. Have it build short, easy to understand scripts you can commit to your repo, with files and line numbers and zero/nonzero exit codes.

It’s that key step of transforming AI insights into detection tools that transforms your outcomes from probabilistic to deterministic. Ask it to optimize the tools so they run in seconds. You can leave them in the codebase forever as linters, integrate them in your CI, and never have that same bug again.

Eggpants 4 months ago

I have to admit, I expected a couple of "You should rewrite it in Rust" hipster posts by now... Maybe they caught on that those types of posts were not having the effect they thought they would? I kid, I kid... mostly

dude250711 4 months ago

[flagged]

appleaday1 4 months ago

THe Canadian gubernment should probably get a bug bounty program so I can present some of my findings to the them that I found using ai and tested or mapped things out on some of their public facing apps on the app store/play store.