Comment by phillipcarter
5 months ago
Neither a statement for or against Grok or Anthropic:
I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.
O1 pro is excellent at figuring out complex stuff that Claude misses. It’s my go to mid level debug assistant when Claude spins
Ive found the same but find o3-mini just as good as that. Sonnet is far better as a general model, but when it's an open-ended technical question that isn't just about code, o3-mini figures it out while Sonnet sometimes doesn't. In those cases o3 is less inclined to go with purely the most "obvious" answer when it's wrong.
I have never, in frontend, backend, or Android, had O1 pro solve a problem Claude 3.5 could not. I've probably tried it close to 20 times now as well
What's really the value of a bunch of random anecdotes on HN — but in any case, I've absolutely had the experience of 3.5 falling over on its face when handling a very complex coding task, and o1 pro nailing it perfectly.
Excited to try 3.7 with reasoning more but so far it seems like a modest, welcome upgrade but not any sort of leapfrog past o1 pro.
I've never had o1 figure something out that Claude Sonnet 3.5 couldn't. I can only imagine the gap has widened with 3.7.