Comment by virgildotcodes

10 days ago

These benchmarks are super impressive. That said, Gemini 3 Pro benchmarked well on coding tasks, and yet I found it abysmal. A distant third behind Codex and Claude.

Tool calling failures, hallucinations, bad code output. It felt like using a coding model from a year ago.

Even just as a general use model, somehow ChatGPT has a smoother integration with web search (than google!!), knowing when to use it, and not needing me to prompt it directly multiple times to search.

Not sure what happened there. They have all the ingredients in theory but they've really fallen behind on actual usability.