Comment by manmal

9 months ago

The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?

24 comments

manmal

lhl 9 months ago

I've been using o3 extensively since release (and a lot of Deep Research). I also use a lot of Claude and Gemini 2.5 Pro (most of the times, for code I'll let all of them go at it and iterate on my fav results).

So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.

I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.

Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).

I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).

Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.

(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)

petesergeant 8 months ago
I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think
- lhl 8 months ago
  
  Yeah, I think personalized evals will definitely be a thing. Besides reviewing way too much Arena, WildChat and having now seen lots of live traces firsthand, there's a wide range of LLM usage (and preferences), which really don't match my own tastes or requirements, lol.
  For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...
- HDThoreaun 8 months ago
  
  I think its more likely that we move away from benchmarks and towards more of a traditional reviewer model. People will find LLM influencers whose takes they agree with and follow them to keep up with new models.
manmal 9 months ago
Thanks for your input, very appreciated. Just in case you didn’t mean Claude Code, it’s really good in my experience and mostly stable. If something fails, it just retries and I don’t notice it much. Its autonomous discovery and tool use is really good and I‘m relying more and more on it.
- lhl 8 months ago
  
  For the Claude issues, I'm referring to the claude.ai frontend. While I use some Codex, Aider, and other agentic tools, I found Claude Code to be not to my taste - for my uses it tended burn a lot of tokens and gave relatively mediocre results, but I know it works well for others, so YMMV.
  
  1 reply →

petesergeant 8 months ago

I am starting to feel like hallucination is a fundamentally unsolvable problem with the current architecture, and is going to keep squeezing the benchmarks until something changes.

At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.

OccamsMirror 8 months ago
If you could actually teach these models things, not just in the current context, but as temporal learning, then that would alleviate a lot of the issues of hallucination. I imagine being able to say "that method doesn't exist, don't recommend it again" and then give it the documentation and it would absorb that information permanently, that would fundamentally change how we interact with these models. But can that work for models hosted for everyone to use at once?
- petesergeant 8 months ago
  
  There are an almost infinite number of things that can be hallucinated, though. You can't maintain a list of scientific papers or legal cases that don't exist! Hallucinations (almost certainly) aren't specific falsehoods that need to be erased...
  
  1 reply →
varjag 8 months ago
Hallucination rate from o3 onward appear to be very low, to the point I rarely have to check.
- petesergeant 8 months ago
  
  This doesn't match my experience, so if I were you I'd absolutely keep checking.

dyauspitr 9 months ago

Don’t they have a full fledged version of o4 somewhere internally at this point?

ankit219 9 months ago

They do it seems. o1 and o3 were based on the same base model. o4 is going to be based on a newer (and perhaps smarter) base model.

bachittle 9 months ago

it's the same model as o3, just with thinking tokens turned up to the max.

Tiberium 9 months ago
That's simply not true, it's not just "max thinking budget o3" just like o1-pro wasn't "max thinking budget o1". The specifics are unknown, but they might be doing multiple model generations and then somehow picking the best answer each time? Of course that's a gross simplification, but some assume that they do it this way.
- firejake308 9 months ago
  
  > "We also introduced OpenAI o3-pro in the API—a version of o3 that uses more compute to think harder and provide reliable answers to challenging problems"
  Sounds like it is just o3 with higher thinking budget to me
- Gerardo1 9 months ago
  
  > That's simply not true, it's not just "max thinking budget o3"
  > The specifics are unknown, but they might...
  Hold up.
  > but some assume that they do it this way.
  Come on now.
  
  4 replies →