Comment by azinman2
5 months ago
To me the biggest surprise was seeking grok dominate in all of their published benchmarks. I haven’t seen any benchmarks of it yet (which I take with a giant heap of salt), but it’s still interesting nevertheless.
I’m rooting for Anthropic.
Neither a statement for or against Grok or Anthropic:
I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.
O1 pro is excellent at figuring out complex stuff that Claude misses. It’s my go to mid level debug assistant when Claude spins
Ive found the same but find o3-mini just as good as that. Sonnet is far better as a general model, but when it's an open-ended technical question that isn't just about code, o3-mini figures it out while Sonnet sometimes doesn't. In those cases o3 is less inclined to go with purely the most "obvious" answer when it's wrong.
I have never, in frontend, backend, or Android, had O1 pro solve a problem Claude 3.5 could not. I've probably tried it close to 20 times now as well
1 reply →
I've never had o1 figure something out that Claude Sonnet 3.5 couldn't. I can only imagine the gap has widened with 3.7.
Grok does the most thinking out of all models I tried (it can think for 2+ minutes), and that's why it is so good, though I haven't tried Claude 3.7 yet.
Yeah, putting it on the opposite side of that comparison chart was a sleezy but likely effective move.
Indeed. I wonder what the architecture for Claude and Grok3 is. If they're still dense models was the MoE excitement with R1 was a tad premature...