Comment by eranation

14 hours ago

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

12 comments

eranation

tirpen 8 hours ago

Which LLM did you use? I assume that will make a pretty big difference.

eranation 7 hours ago

gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)
Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.
I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.

bobkb 9 hours ago

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

akie 14 hours ago

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

witx 13 hours ago
What, no they're not. You still need to analyze them to understand they are false positives. It's time wasted
- chaoz_ 10 hours ago
  
  Agree, it's something that will eventually teach your developers to ignore points raised as it's mostly garbage.
- onion2k 12 hours ago
  
  Finding problems is optimizing for the customer. Avoiding false positives is optimizing for the developer. Which is right depends on your org's culture.
  
  4 replies →

isabellehue 12 hours ago

[flagged]