Comment by zmmmmm
13 days ago
My experience with using AI tools for code review is that they do find critical bugs (from my retrospective analysis, maybe 80% of the time), but the signal to noise ratio is poor. It's really hard to get it not to tell you 20 highly speculative reasons why the code is problematic along with the one critical error. And in almost all cases, sufficient human attention would also have identified the critical bug - so human attention is the primary bottleneck here. Thus poor signal to noise ratio isn't a side issue, it's one of the core issues.
As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.
> but the signal to noise ratio is poor
Nail on the head. Every time I've seen it applied, its awful at this. However this is the one thing I loathe in human reviews as well, where people are leaving twenty comments about naming and then the actual FUNCTIONAL issue is just inside all of that mess. A good code reviewer knows how to just drop all the things that irk them and hyperfocus on what matters, if there's a functional issue with the code.
I wonder if AI is ever gonna be able to conquer that one as its quite nuanced. If they do though, then I feel the industry as it is today, is kinda toast for a lot of developers, because outside of agency, this is the one thing we were sorta holding out on being not very automatable.
at my last job code review was done directly in your editor (with tooling to show you diffs as well).
What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.
This was typically a big shock for new hires who were used to the "comment for every nitpick" system; I think it can feel insulting when someone changes your feature. But I quickly came to love it and can't imagine doing code review any other way now. It's so much faster!
I'm not sure how to tie this to AI code review tbh. Right now I don't think I'd trust a model's taste for when to change things and when to leave a comment. But maybe that'll change. I agree that if you automated away my taste for code it'd put me in a weird spot!
Was that Jane Street? I remember watching a presentation from someone there about such a system.
If not, any chance this tooling is openly available?
2 replies →
This is the workflow I've always dreamed of. In a lot of ways making a change which is then submitted as patch to their patch isn't really that different from submitting a comment to their patch. The workflow of doing that directly in editor is just wonderful.
If I had to pick, I actually think ONLY being able to submit "counter-patches" would be better than only being able to submit comments. Comments could just be actual programming language style comments submitted as code changes.
What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?
I like this review method too though, and like that some pr review tools have a 'suggest changes' and 'apply changes' button now too
5 replies →
If minor mistakes are corrected without the PR author's input, do they ever learn to stop making those mistakes themselves? It seems like a system where you just never bother to learn, e.g., style conventions, because reviewers just apply edits as needed.
> What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.
This is my dream; have only had a team with little enough ego to actually achieve it once for an unfortunately short period of time. If it's something that there's a 99% chance the other person is going to say 'oh yeah, duh' or 'sure, whatever' then it's just wasting both of your time to not just do it.
That said, I've had people get upset over merging their changes for them after a LGTM approval when I also find letting it sit to be a meaningless waste of time.
Interesting approach. I think it could have the reviewers to be more serious about their feedback. Comments are a bit too casual and may contain more "unconstructive" information.
I just this morning had someone "nitpick" on a PR I made and ask for a change that would have broken the code.
If the reviewer can make changes without someone reviewing their change, it's just waiting to blow up in. your face.
1 reply →
That sounds great. Was that proprietary tooling? I'd be interested in some such thing.
5 replies →
Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.
On the other hand, if it's less important code or the renaming is not clearly an improvement it can be quite useless. But I've met some developers who has the opinion of reviews as pointless and just say "this works, just approve it already" which can be very frustrating when it's a codebase with a lot of collaboration.
Naming comments are useful when someone catches something like:
1. you are violating a previously agreed upon standard for naming things
2. inconsistent naming, eg some places you use "catalog ID" and other places you use "item ID" (using separate words and spaces here because case is irrelevant).
3. the name you chose makes it easy to conflate two or more concepts in your system
4. the name you chose calls into question whether you correctly understood the problem domain you are addressing
I'm sure there are other good naming comments, but this is a reasonable representation of the kinds of things a good comment will address.
However, most naming comments are just bike shedding.
4 replies →
> Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.
yes but it can be severely diminishing returns. Like lets step back a second and ask ourselves if:
var itemCount = items.Count;
vs
var numberOfItems = items.Count;
is ever worth spending the time discussing, versus how much of a soft improvement it makes to the code base. I've literally been in a meeting room with three other senior engineers killing 30 minutes discussing this and I just think that's a complete waste of time. They're not wrong, the latter is clearer, but if you have a PR that improves the repo and you're holding it back because of something like this, then I don't think you have your priorities straight.
10 replies →
A lot of these comments are not pointing out actual issues, just "That's not how I would have done it" type comments.
1 reply →
Human comments tend to be short and sweet like "nit: rename creatorOfWidgets to widgetFactory". Whereas AI code review comments are long winded not as precise. So even if there are 20 humans comments, I can easily see which are important and which aren't.
We are using BitBucket at work and decided to turn on RovoDev as reviewer. It absolutely doesn’t do that. Few but relevant comments are the norm and when we don’t like something it says we tell it in its instructions file to stop doing that. It has been working great!
My coworker is so far on this spectrum it's a problem. He writes sentences with half the words missing making it actually difficult to understand what he is trying to suggest.
All of the non critical words in english aren't useless bloat, they remove ambiguity and act as a kind of error correction if something is wrong.
it "nit" short for nitpick? I think prefixing PR comments with prefixes like that is very helpful for dealing with this problem.
6 replies →
Depends on what you're targeting
- If it's a rough PR, you're looking for feedback on direction rather than nitpicks.
- If it's in a polished state, it's good to nitpick assuming you have a style guide you're intending to adhere to.
Perhaps this can be provided in the system prompt?
You can do both.
The noise is often what hides the bug in the first place. Aim for more straightforward code and the bug will often surface.
For a while when Node was switching to async from promise chains, people would bring me code or tests that were misfiring and they couldn’t tell why. Often it was because of either a bug in the promise chaining or someone tried to graft async code into the chain and it was an impedance mismatch.
I would start them by asking them to make the function fully async and then come back. About half the time the code just fixed itself. Because the intention of the code was correct, but there was some subtle bookkeeping or concurrency issue that was obfuscated. About a quarter of the time the bug popped out and the dev fixed it. And about a quarter of the time there was a legitimate bug that had been sleeping. The function was fine ones own, but something it called was broken.
There are a lot of situations like that out there. The code distracts from the real problem, but just fixing the “real problem” is a disservice because another real problem will happen later. Make the change easy. That’s always the middle of the solution.
Let me throw something out there: poor naming obscures and distracts from functional issues. You are right about a good reviewer, but a good author strives for clarity in addition to correctness.
As an aside, naming is highly subjective. Like in writing, you tailor naming to the problem domain and the audience.
This is why you should set guidelines for reviews (like e.g. https://go.dev/wiki/CodeReviewComments), and ideally automate as much as possible. I'm guilty of this as well, leaving loads of nitpicky code style comments - but granted, this was before Prettier was a thing. In hindsight, I could've spent all that time building a code formatter myself lol.
If you are nitpicking style or conventions that do not have rules in your linting tools, then those should automatically be non-issues, IMO.
I follow a consistent comment pattern[0] that makes blocking vs non-blocking easy to identify.
[0]: https://conventionalcomments.org/
Don't get me started on not CamelCasing acronyms, acronyms aren't more important than regular words! :)
Yeah or worse like my boss. We don't have a style guide. But he always wants style changes in every PR, and those style changes are some times contradictory across different PRs.
Eventually I've told him "if your comment does not affect performance or business logic, I'm ignoring it". He finally got the message. The fact that he accepted this tells me that deep down he knew his comments were just bike shedding.
I've been in teams like this - people who are lower on the chain of power get run in circles as they change to appease one, then change to appease another then change to go back to appease the first again.
Then, going through their code, they make excuses about their code not meeting the same standards they demand.
As the other responder recommends, a style guide is ideal, you can even create an unofficial one and point to it when conflicting style requests are made
1 reply →
You should have a style guide, or adopt one. Having uniform code is incredibly valuable as it greatly reduces the cognitive load of reading it. Same reason that Go's verbose "err != nil" works so well.
2 replies →
At augment code we specifically build our code review tool to find noise to signal ratio problem. In benchmark our comments are 2 to 3x more likely to get fixed compared to bugbot coderabbit etc
You should check it at Augmentcode.com
That's not even mentioning a not insignificant part of the point of code reviews is to propagate understanding of the evolution of the code base among other team members. The reviewer benefits from the act of reviewing as well.
How is that different from today's SA, like CodeQL and SonarQube? Most of the feedback is just sh*t and drives programmers towards making senseless perfections that just double the amount of work had to be done later to toggle or tune behaviour, because the configurable variables are gone due to bad static code analysis. Clearly present intent and convience like: Making a method virtual, adding a public method, not making a method static when it is likely to use instance fields in the future --- these good practices are shunned in all SA just because the rules are opportunistic, not real.
My experience at work: Claude regularly says to use one method over another, because it's "safer"... But the method doesn't actually exist in that language. Seems to get rather confused between C# and C++, despite also getting told the language, before and after getting handed the code.
It very much depends on the product. In my experience, Copilot has terrible signal noise. But Bugbot is incredible. Very little noise and it consistently finds things the very experienced humans on my team didn’t.
One thing I've found to be successful is to
1) give it a number of things to list in order of severity
and
2) tell it to grade how serious of a problem it may be
The human reviewer can then look at the top ten list and what the LLM thinks about its own list for a very low overhead of thinking (i.e. if the LLM thinks its own ideas are dumb a human probably doesn't need to look into them too hard)
It also helps to explicitly call out types of issue (naming, security, performance, correctness, etc)
The human doesn't owe the LLM any amount of time considering, it's just an idea generating tool. Looking through a top ten list formatted as a table can be scanned in 10 seconds in a first pass.
I've only managed to use it as a linter-but-on-steroids because, where I'd normally page through the Ruby docs about enumerators to find the exact method that does what someone has implemented in a PR (because there's almost always something in there that can help out), I can instead prompt to look up a more idiomatic version of the implementation for the ruby version being used. It's easy to cross-check and it saves me some time.
It's not very good with the rest, because there's an intuition that needs to be developed over time that takes all the weirdness into account. The dead code, the tech debt, the stuff that looks fundamentally broken but is depended on because of unintended side effects, etc. The code itself is not enough to explain that, it's not a holistic documentation of the system.
The AI is no different to a person here: something doesn't 'feel' right, you go and fix it, it breaks, so you have to put it back again because it's actually harder than you think to change it.
I've been using it a bit lately and at first I was enjoying it, but then it quickly devolved into finding more different minor issues with each minor iteration, including a lovely loop of check against null rather than undefined, check against undefined rather than null etc.
> signal to noise ratio is poor
I think this is the problem with just about every tool that examines code.
I've had the same problem with runtime checkers, with static analysis tools, and now ai code reviews.
Might be the nature of the beast.
probably happens with human code reviews too. Lots of style false positives :)
The signal-to-noise ratio problem is unexpectedly difficult.
We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up
Much has changed on our approach since then, so we'll probably write a a new blog post.
The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff
very interesting! yes everything you say aligns with my experience and instincts.
I agree but find it's fairly easy noise to ignore.
I wouldn't replace human review with LLM-review but it is a good complement that can be run less frequently than human review.
Maybe that's why I find it easy to ignore the noise, I have it to a huge review task after a lot of changes have happened. It'll find 10 or so things, and the top 3 or 4 are likely good ones to look deeper into.
I suspect the noise is largely an artifact of cost optimization. Most tools restrict the context to just the diff to save on input tokens, rather than traversing the full dependency graph. Without seeing the actual definitions or call sites, the model is forced to speculate on side effects.
I’d like to see someone try AI guided property based testing. The random walk could take a long time ti notice problems with code. I’d trust AI more if it brought proof instead of speculation
My experience is similar. AI's context is limited to the codebase. It has limited or no understanding of the broader architecture or business constraints, which adds to the noise and makes it harder to surface the issues that actually matter.
It also acts as mainly an advanced linter. The other day it pointed out some overall changes in a piece of code, but didn't catch that the whole thing was useless and could've been replaced with an "on conflict to update" in postgres.
Now, that could happen with a human reviewer as well. But it didn't catch the context of the change.
For the signal to noise reason, I start with Claude Code reviewing a PR. Then I selectively choose what I want to bubble up to the actual review. Often times, there's additional context not available to the model or it's just nit picky.
Wait, so you have the LLM review, then you review (selectively choose) the proposed review, then you (or the LLM?) review the reviewed review? But often times (so, the majority?) the initial LLM review is useless, so you're reviewing reviews that won't pass review...
Sounds incredibly pointless. But at least you're spending those tokens your boss was forced to buy so the board can tell the investors that they've jumped on the bandwagon, hooray!
Agreed.
I have to constantly push back against it proposing C++ library code, like std::variant, when C-style basics are working great.
You should try Codex. There's a pretty wide gap between the quality of code review tools out there.
I absolutely hate the verbosity of AI. I know that you can give it context; I have done it, and it helps a little. It will still give me 10 "ideas", many of which are closely related to each other.