← Back to context

Comment by manmal

6 days ago

The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?

I've been using o3 extensively since release (and a lot of Deep Research). I also use a lot of Claude and Gemini 2.5 Pro (most of the times, for code I'll let all of them go at it and iterate on my fav results).

So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.

I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.

Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).

I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).

Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.

(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)

  • I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think

    • Yeah, I think personalized evals will definitely be a thing. Besides reviewing way too much Arena, WildChat and having now seen lots of live traces firsthand, there's a wide range of LLM usage (and preferences), which really don't match my own tastes or requirements, lol.

      For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...

    • I think its more likely that we move away from benchmarks and towards more of a traditional reviewer model. People will find LLM influencers whose takes they agree with and follow them to keep up with new models.

  • Thanks for your input, very appreciated. Just in case you didn’t mean Claude Code, it’s really good in my experience and mostly stable. If something fails, it just retries and I don’t notice it much. Its autonomous discovery and tool use is really good and I‘m relying more and more on it.

    • For the Claude issues, I'm referring to the claude.ai frontend. While I use some Codex, Aider, and other agentic tools, I found Claude Code to be not to my taste - for my uses it tended burn a lot of tokens and gave relatively mediocre results, but I know it works well for others, so YMMV.

      1 reply →

I am starting to feel like hallucination is a fundamentally unsolvable problem with the current architecture, and is going to keep squeezing the benchmarks until something changes.

At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.

  • If you could actually teach these models things, not just in the current context, but as temporal learning, then that would alleviate a lot of the issues of hallucination. I imagine being able to say "that method doesn't exist, don't recommend it again" and then give it the documentation and it would absorb that information permanently, that would fundamentally change how we interact with these models. But can that work for models hosted for everyone to use at once?

    • There are an almost infinite number of things that can be hallucinated, though. You can't maintain a list of scientific papers or legal cases that don't exist! Hallucinations (almost certainly) aren't specific falsehoods that need to be erased...

      1 reply →

  • Hallucination rate from o3 onward appear to be very low, to the point I rarely have to check.

Don’t they have a full fledged version of o4 somewhere internally at this point?

  • They do it seems. o1 and o3 were based on the same base model. o4 is going to be based on a newer (and perhaps smarter) base model.

it's the same model as o3, just with thinking tokens turned up to the max.

  • That's simply not true, it's not just "max thinking budget o3" just like o1-pro wasn't "max thinking budget o1". The specifics are unknown, but they might be doing multiple model generations and then somehow picking the best answer each time? Of course that's a gross simplification, but some assume that they do it this way.

    • > "We also introduced OpenAI o3-pro in the API—a version of o3 that uses more compute to think harder and provide reliable answers to challenging problems"

      Sounds like it is just o3 with higher thinking budget to me

    • > That's simply not true, it's not just "max thinking budget o3"

      > The specifics are unknown, but they might...

      Hold up.

      > but some assume that they do it this way.

      Come on now.

      4 replies →