Comment by kuri-sun

7 days ago

Curious what kinds of bugs the multi-agent setup catches thatsingle-pass review misses in practice. Is it more about coverage(different agents looking at different aspects) or about getting a second opinion on the same aspect? The README has examples but the mechanism by which the parallelism actually helps isn't obvious to me from them.

7 comments

kuri-sun

adamthegoalie 7 days ago

I was thinking about building a GitHub repo made for evaluating Code Reviews. Something like a complex app (or perhaps a few branches with different options), and then PRs on each branch with varying types and degrees of bugs for a Code Review to find.

I suppose this would not be a 'real' benchmark because it would be public and so you couldn't necessarily trust scores people share about how their own tool did, but it would at least allow anyone to try out code review tools on their own and report relative effectiveness and characteristics.

I'll post again if I end up finding or building something like that. I couldn't find anything when I looked previously.

I'll also keep in mind your question as I continue testing this, because you are right that it would be useful to be able to describe what is different, not just the magnitude of bugs found.

esafak 7 days ago

https://codereview.withmartian.com/
https://www.greptile.com/benchmarks

lmeyerov 7 days ago

Yes, being comprehensive, so early or blatant cheapo findings do not distract from other ones. That's important for base results. Splitting in both file and task is (currently) important.

Additionally, we run in a loop until it stops finding things, and as part of that, do test amplification when it does find any. We regularly see 3-8 rounds yielding valid results.

IMO half the value is customization to your repo, so copying these and specializing to your repo is super quick and pays off almost immediately . How to find style guides, how to run tests, what dimensions of correctness to look for, etc.

This kind of thing makes me question how important Mythos is for security bug finding - doing a High effort loop with a frontier model in code reviews until convergence has already outperformed human review for us . (Doesn't replace, but does find things we miss, and catches many we do see earlier).

esperent 7 days ago
How do you prevent it from increasing scope?
That's the main issue I've found from running loops like this. Each loop has ~7 agents, say, looking through different lenses (security, UX, performance, etc.). Each one notes a few issues, each issue gets fixed, you do 5 to 8 loops, as you say. Each individual item that gets fixed looks minor but when you add it all up at the end you've increased PR size and scope significantly.
- adamthegoalie 7 days ago
  
  That is such a good point.
  I recently opened a PR against this AI personal finance tool Ray https://github.com/cdinnison/ray-finance/pull/8 to add an Apple Card import feature, since Apple Card is not supported by Plaid.
  I built the manual import feature, opened the PR, and then ran a code review.
  What I hadn't thought about when I built the feature, was the myriad ways that the implications of importing data from Apple would have to be considered and integrated into the rest of the app, for the manual import to be a first-class feature, not "just a manual import" of data.
  I ended up running adamsreview against it like 5-10 times, before considering it complete, as I learned that there was much more to the integration than I realized.
  Now is that necessarily a problem? Maybe not. I should have realized from the start that the import feature was going to much more than just a small feature. But at least, thanks to the review loop, I got it completely right before the PR was merged.
  
  1 reply →
- azurewraith 6 days ago
  
  I've had similar experiences when I throw a bunch of agents at a problem... some things get flagged but a lot gets truncated in the summarization step. Per-phase constraints solve this naturally, and I think the problem is better suited to be solved serially. Have each specialized 'review' phase scoped to only read and annotate (even better with a code-owners style read scoping) with max iterations in deterministic code. The scope can't creep past the constraints you've set for it. Scope explosion comes from agents having unbounded tool access and no transition gates between phases... it will overreach if given the opportunity to