Comment by __0x01
11 hours ago
> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
It would already help a lot when the C and C++ standards start to clean up the list of Undefined Behaviour (e.g. there's a lot of nonsense UB currently in the C standard which could easily become Defined Behaviour - like the "file doesn't end in a new-line character" thing):
https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1
The C committee is cleaning up a lot of UB (check https://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_lo... for paper titles like "slaying earthly demons").
But don't misunderstand the goal of that: C and C++ will never get rid of UB. The result of dereferencing an invalid pointer is UB, will always remain UB, and really cannot be anything other than UB.
The easy cases like you cite are also those that don’t cause problems in practice. I’m not sure that would help all that much, other than to slightly reduce internet criticism.
Fixing easy cases makes the list shorter, so enables more focus on harder cases.
And it also signals that you actually do want to improve, just a little bit of boy scout rule goes a long way.
5 replies →
Author here.
> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.
And it's a problem that we have way more existing UB than expert capacity.
Now, will LLMs and experts both miss UB in some cases? Of course. There's no 100% solution. But LLMs, I claim, will find orders of magnitude more, with low false positive, than any expert. Even if these expert humans (like in the OpenBSD case for the two bugs I found, one of which was UB) are given more than three decades to do it.
I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.
Our LLM powered coding assistance are pretty good at doing lots of busywork that doesn't require all that much smarts. So they can supervise running our UB checks, like Valgrind, and making the linters happy.
> LLM generated code will eventually contain UB.
Yes.
Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).
When LLMs generate code, all languages have UB.
That's a bit silly.
UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.
> So if you standard says 'you have to crash with an error message' that's already no longer UB.
Sure. For crashes. But when you instruct an LLM to do something, the output is probablistic, so you may get behviour that is unexpected and/or unwanted.
Like storing security tokens in code. Or nuking the production database.