Comment by TuxSH

9 months ago

Bad (though I haven't tested autocompletion). It's underperforming other models on livebench.ai.

With Copilot Pro and DeepSeek's website, I ran "find logic bugs" on a 1200 LOC file I actually needed code review for:

- DeepSeek R1 found like 7 real bugs out of 10 suggested with the remaining 3 being acceptable false positives due to missing context

- Claude was about the same with fewer remaining bugs; no hallucinations either

- Meanwhile, Gemini had 100% false positive rate, with many hallucinations and unhelpful answers to the prompt

I understand Gemini 2.0 is not a reasoning model, but DeepClaude remains the most effective LLM combo so far.

1 comment

TuxSH

ryao 9 months ago

I have seen Gemini hallucinate ridiculous bugs in a file that had less than 1000 LOC when I was scratching my head over what was wrong. The issue turned out to be that the cbBLAS matrix multiplication functions expected column major indexing while the code expected row major indexing.