← Back to context

Comment by eranation

14 hours ago

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

Which LLM did you use? I assume that will make a pretty big difference.

  • gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)

    Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.

    I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

  • What, no they're not. You still need to analyze them to understand they are false positives. It's time wasted

    • Agree, it's something that will eventually teach your developers to ignore points raised as it's mostly garbage.

    • Finding problems is optimizing for the customer. Avoiding false positives is optimizing for the developer. Which is right depends on your org's culture.

      4 replies →