← Back to context

Comment by sho_hn

17 days ago

Anyone have a take on how the coding performance (quality and speed) of the 2.0 Pro Experimental compares to o3-mini-high?

The 2 million token window sure feels exciting.

I don't know what those "needle in haystack" benchmarks are testing for because in my experience dumping a big amount of code in the context is not working as you'd expect. It works better if you keep the context small

  • I think the sweet spot is to include some context that is limited to the scope of the problem and benefit from the longer context window to keep longer conversations going. I often go back to an earlier message on that thread and rewrite with understanding from that longer conversation so that I can continue to manage the context window

  • Claude works well for me loading code up to around 80% of its 200K context and then asking for changes. If the whole project can't fit I try to at least get in headers and then the most relevant files. It doesn't seem to degrade. If you are using something like an AI IDE a lot of times they don't really get the 200K context.

Bad (though I haven't tested autocompletion). It's underperforming other models on livebench.ai.

With Copilot Pro and DeepSeek's website, I ran "find logic bugs" on a 1200 LOC file I actually needed code review for:

- DeepSeek R1 found like 7 real bugs out of 10 suggested with the remaining 3 being acceptable false positives due to missing context

- Claude was about the same with fewer remaining bugs; no hallucinations either

- Meanwhile, Gemini had 100% false positive rate, with many hallucinations and unhelpful answers to the prompt

I understand Gemini 2.0 is not a reasoning model, but DeepClaude remains the most effective LLM combo so far.

  • I have seen Gemini hallucinate ridiculous bugs in a file that had less than 1000 LOC when I was scratching my head over what was wrong. The issue turned out to be that the cbBLAS matrix multiplication functions expected column major indexing while the code expected row major indexing.