OpenAI o3-pro

5 days ago (help.openai.com)

207 comments

mfiguiere

I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing slight improvement, I have seen that each new generation feels actively better at exactly the same tasks I gave the previous generation.

It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.

codingwagie 5 days ago
I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.
- IanCal 4 days ago
  
  A really important thing is the distinction between performance and utility.
  Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.
- shmoogy 5 days ago
  
  Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.
- protocolture 4 days ago
  
  I am finding that my ability to use it to code, aligns almost perfectly with increasing token memory.
- kevinqi 4 days ago
  
  yeah, the benchmarks are just a proxy. o3 was a step change where I started to really be able to build stuff I couldn't before
- alightsoul 5 days ago
  
  mind telling examples?
  
  5 replies →
- iLoveOncall 4 days ago
  
  Okay but this has all to do with the tooling and nothing to do with the models.
  
  10 replies →
energy123 5 days ago
That would require AIME 2024 going above 100%.
There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.
Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.
If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.
- croddin 5 days ago
  
  There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:
  "ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task
  ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
  Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"
  - https://x.com/arcprize/status/1932535378080395332
  
  20 replies →
littlestymaar 4 days ago
> I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro.
This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.
At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.
- indigo945 4 days ago
  
  I have tried Claude 4.0 for agentic programming tasks, and it really outperforms Claude 3.7 by quite a bit. I don't follow the benchmarks - I find them a bit pointless - but anecdotally, Claude 4.0 can help me in a lot of situations where 3.7 would just flounder, completely misunderstand the problem and eventually waste more of my time than it saves.
  Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.
- Voloskaya 4 days ago
  
  > We're well into the diminishing returns at this point
  Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.
  Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.
  Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.
  * GPT-1 was 117M parameters.
  * GPT-2 was 1.5B params (~10x).
  * GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).
  After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.
  But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.
  So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.
  So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.
  
  1 reply →
XCSme 5 days ago
I remember the saying that from 90% to 99% is a 10x increase in accuracy, but 99% to 99.999% is a 1000x increase in accuracy.
Even though it's a large10% increase first then only a 0.999% increase.
- zmgsabst 4 days ago
  
  Sometimes it’s nice to frame it the other way, eg:
  90% -> 1 error per 10
  99% -> 1 error per 100
  99.99% -> 1 error per 10,000
  That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).
  
  3 replies →
- bobbylarrybobby 4 days ago
  
  I think the proper way to compare probabilities/proportions is by odds ratios. 99:1 vs 99999:1. (So a little more than 1000x.) This also lets you talk about “doubling likelihood”, where twice as likely as 1/2=1:1 is 2:1=2/3, and twice as likely again is 4:1=4/5.
- jsjohnst 4 days ago
  
  The saying goes:
  From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.
- AtlasBarfed 3 days ago
  
  What's the required computation power for those extra 9s? Is it linear, poly, or exponential?
  Imo we got to the current state by harnessing GPUs for a 10-20x boost over CPUs. Well, and cloud parallelization, which is ?100x?
  ASIC is probably another 10x.
  But the training data may need to vastly expand, and that data isn't going to 10x. It's probably going to degrade.
avereveard 4 days ago

There's a new set of metrics that capture advances better than MMLU or it's pro version but nothing yet as standardized and specifically very few have a hidden set of tests to keep advancements from been from directional fine tuning.
jstummbillig 5 days ago
It's hard to be 100% certain, but I am 90% certain that the benchmarks leveling off, at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).
- motorest 5 days ago
  
  > (...) at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).
  I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?
- alightsoul 5 days ago
  
  either that or the improvements aren't as large as before.

chad1n 5 days ago

The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.

simonw 5 days ago
That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.
https://simonwillison.net/2025/Jun/10/o3-pro/
- teruakohatu 4 days ago
  
  Do you think a cycling pelican is still a valid cursory benchmark? By now surely discussions about it are in the training set.
  There is quite a few on Google Image search.
  On the other hand they still seem to struggle!
- FergusArgyll 5 days ago
  
  Wow! pelican benchmark is now saturated
  
  2 replies →
- CamperBob2 5 days ago
  
  Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.
  Of course by now it'll be in-distribution. Time for a new benchmark...
  
  8 replies →
- AstroBen 5 days ago
  
  That's one good looking pelican
- torginus 3 days ago
  
  This made me think of the 'draw a bike experiment', where people were asked to draw a bike from memory, and were suprisingly bad at recreating how the parts fit together in a sensible manner:
  https://road.cc/content/blog/90885-science-cycology-can-you-...
  ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?
- k2xl 5 days ago
  
  Not distilled, same model. https://x.com/therealadamg/status/1932534244774957121?s=46&t...
  
  1 reply →
- eru 4 days ago
  
  Well, that might be more of a function of how long they let it 'reason' than anything intrinsic to the model?
- Terretta 5 days ago
  
  > It's only available via the newer Responses API
  And in ChatGPT Pro.
torginus 4 days ago

I've wondered if some kind of smart pruning is possible during evaluation.
What I mean by that, is if a neuron implements a sigmoid function and its input weights are 10,1,2,3 that means if the first input is active, then evaluation the other ones is mathematically pointless, since it doesn't change the result, which recursively means the inputs of those neurons that contribute to the precursors are pointless as well.
I have no idea how feasible or practical is it to implement such an optimization and full network scale, but I think its interesting to think about
gkamradt 5 days ago

o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585
weinzierl 5 days ago

Is there a way to figure out likely quantization from the output. I mean, does quantization degrade output quality in certain ways that are different from other modification of other model properties (e.g. size or distillation)?
hapticmonkey 4 days ago
What a great future we are building. If AI is supposed to run everything, everywhere....then there will be 2, maybe 3, AI companies. And nobody outside those companies knows how they work.
- eru 3 days ago
  
  What makes you think so? So far, many new AI companies are sprouting and many of them seem to be able to roughly match the state-of-the-art very quickly. (But pushing the frontier seems to be harder.)
  From the evidence we have so far, it does not look like there's any natural monopoly (or even natural oligopoly) in AI companies. Just the opposite. Especially with open weight models, or oven more so complete open source models.
- jsjohnst 3 days ago
  
  > And nobody outside those companies knows how they work.
  I think you meant to say:
  And nobody knows how they work.

manmal 5 days ago

The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?

lhl 4 days ago
I've been using o3 extensively since release (and a lot of Deep Research). I also use a lot of Claude and Gemini 2.5 Pro (most of the times, for code I'll let all of them go at it and iterate on my fav results).
So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.
I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.
Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).
I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).
Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.
(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)
- petesergeant 4 days ago
  
  I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think
  
  2 replies →
- manmal 4 days ago
  
  Thanks for your input, very appreciated. Just in case you didn’t mean Claude Code, it’s really good in my experience and mostly stable. If something fails, it just retries and I don’t notice it much. Its autonomous discovery and tool use is really good and I‘m relying more and more on it.
  
  2 replies →
petesergeant 4 days ago
I am starting to feel like hallucination is a fundamentally unsolvable problem with the current architecture, and is going to keep squeezing the benchmarks until something changes.
At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.
- OccamsMirror 4 days ago
  
  If you could actually teach these models things, not just in the current context, but as temporal learning, then that would alleviate a lot of the issues of hallucination. I imagine being able to say "that method doesn't exist, don't recommend it again" and then give it the documentation and it would absorb that information permanently, that would fundamentally change how we interact with these models. But can that work for models hosted for everyone to use at once?
  
  2 replies →
- varjag 3 days ago
  
  Hallucination rate from o3 onward appear to be very low, to the point I rarely have to check.
  
  1 reply →
dyauspitr 5 days ago
Don’t they have a full fledged version of o4 somewhere internally at this point?
- ankit219 5 days ago
  
  They do it seems. o1 and o3 were based on the same base model. o4 is going to be based on a newer (and perhaps smarter) base model.
bachittle 5 days ago
it's the same model as o3, just with thinking tokens turned up to the max.
- Tiberium 5 days ago
  
  That's simply not true, it's not just "max thinking budget o3" just like o1-pro wasn't "max thinking budget o1". The specifics are unknown, but they might be doing multiple model generations and then somehow picking the best answer each time? Of course that's a gross simplification, but some assume that they do it this way.
  
  6 replies →

mark_l_watson 5 days ago

I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.

I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on AI research and sometimes practical applications for my life since then.

I can easily afford $200 for a Pro account but I get this nagging feeling that LLMs are not the final path to the powerful AI I have always dreamed of and I don't want to support this level of hype.

I have lived through a few AI winters and I worry that accountants will tally up the costs, environmental and money, versus the benefits and that we collectively have an 'oh shit' moment.

baq 4 days ago
LLMs would be transformative technology if all progress stopped today if only for their NLP capabilities, but the recent models obviously do so much more than that. Winter isn’t coming in that regard; what might happen if models won’t get smarter from here is a race to the bottom in token prices, which would still be not bad at all for token buyers.
- buu700 4 days ago
  
  Agreed. I've said exactly the same thing before. If GPT-4 from two years ago had turned out to be the endgame of LLM technology, and we collectively spent the following 20 years integrating those capabilities throughout the economy, even that would be a profound change to the world as we know it.
  If we froze LLM technology at present-day capabilities and spent the next 20 years on that, I'd expect it to ultimately look transformative in a similar way to the Internet. I mean if you told me in fall 2022 that 2.5 years later I'd be building software by meta-prompting and meta-meta-prompting AI agents to write code overnight while I slept, I'd assume that we were fictional characters in a Black Mirror episode.
daxfohl 1 day ago

Can you expand any more on the nagging feeling?
jwrallie 4 days ago

I have trouble justifying the $20 tier when compared to other offers for similar service from other providers. I think OpenAI should, every once in a while, offer a new feature with no delay to their Plus tier, with lots of limits of course.
linkage 3 days ago
You don't need a Pro account. I'm on the free tier and I'm paying for o3-pro via the API. I spent just $3.70 in credits yesterday to compare it against Claude 4 Opus.
- johncoatesdev 3 days ago
  
  what do you use as a client? Most open source clients don't seem to support the new endpoint that o3-pro requires.

swyx 5 days ago

here's a nice user review we published: https://www.latent.space/p/o3-pro

sama's highlight[0]:

> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."

I kept nudging the team to go the whole way to just let o3 be their CEO but they didn't bite yet haha

0: https://x.com/sama/status/1932533208366608568

tomComb 5 days ago
Big fan swyx, but both here and in the article there is some bragging about being quoted by sama, and while I acknowledge that that’s not out of the ordinary, I’m concerned about where it leads: what it takes to get quoted by sama (or similar interested party) is saying something good about his product, and having a decent follower count.
Dangerous incentives IMO.
- swyx 5 days ago
  
  acked. in my defense i didnt write the article + ben already had a good track record from the o1 article. while our relationship with oai is v v v impt to us, we've also covered negative openai stories: https://www.latent.space/p/clippy-v-anton and will continue to give balanced coverage with the other labs when they do well.
  we are definitely not seeking to be openai sycophants, nor would they want us to be.
alightsoul 5 days ago
if o3 is so good why aren't they using it to replace management?
- martin_corredor 4 days ago
  
  It's been 1 day
  The technology needs to diffuse through and find its equilibrium within the market
  You could say 3.5/3.7 Sonnet was good enough to replace some juniors but the juniors didn't get replaced immediately - it has a lag in time for it to ripple through

WhitneyLand 5 days ago

So, we currently have o4-mini and o4-mini-high, which represent medium and high usage of “thinking” or use of reasoning tokens.

This announcement adds o3-pro, which pairs with o3 in the same way the o4 models go together.

It should be called o3-high, but to align with the $200 pro membership it’s called pro instead.

That said o3 is already an incredibly powerful model. I prefer it over the new Anthropic 4 models and Gemini 2.5. It’s raw power seems similar to those others, but it’s so good at inline tool use it usually comes out ahead overall.

Any non-trivial code generation/editing should be using an advanced reasoning model, or else you’re losing time fixing more glitches or missing out on better quality solutions.

Of course the caveat is cost, but there’s value on the frontier.

boole1854 5 days ago
No, this doesn't seem to be correct, although confusion regarding model names is understandable.
o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.
o3 can also be run via the API with reasoning={"effort": "high"}.
o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.
See https://platform.openai.com/docs/guides/reasoning?api-mode=r...
- johnecheck 4 days ago
  
  OpenAI started strong in the naming department (ChatGPT, DALL-E) then fell off so hard since.
  
  1 reply →

eru 4 days ago

I'm trying out o3-pro now with some algorithmic questions. It seems to be doing alright, but it's taking an awfully long time (as expected) and the UIs seem to time out a lot, especially the Android app and the MacOS desktop app. The web interface seems the least flaky, but that's not saying much.

varjag 3 days ago
I had my own programming question for a while, which all models from all vendors been robustly failing so far. It is a known problem with surprisingly few published implementations as it had never been a part of leetcode, Euler or typical homework assignments. Yesterday o3-pro cleared it, using a more obscure algorithm I never even heard of.
- reliabilityguy 3 days ago
  
  Can you provide the details? Sounds intriguing
  
  5 replies →

ChrisArchitect 5 days ago

OpenAI dropped the price of o3 by 80%

https://news.ycombinator.com/item?id=44239359

tiahura 5 days ago

So, upgrade to Teams and pay the $50? Plus more usage of o3. Seems like it might be a shot at the $100 claude max?

dog436zkj3p7 5 days ago
What do you mean with "pay the $50"?
Also, does anybody know what limits o3-pro has under the team plan? I don't see it available in the model picker at all (on team).
- sanex 5 days ago
  
  I believe teams is $25/user with a 2 user minimum.
  
  1 reply →

nickandbro 5 days ago

"create a svg of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/c3j6TEAP

in case anyone is interested

ikerino 5 days ago

Am I right to say: doesn't look better than anything we've seen before?

vintagedave 4 days ago

> Update to o4-mini (June 6, 2025) > We are rolling back an o4-mini snapshot, that we deployed less than a week ago and intended to improve the length of model responses, because our automated monitoring tools detected an increase in content flags.

Does anyone know what it did or returned? I had not seen anything, nor have I read anything, about issues here.

ikerino 5 days ago

https://www.latent.space/p/o3-pro

Have completed around a dozen chats with o3-pro so far. Can't say I'm impressed, output feels qualitatively very similar to regular o3.

Tried feeding in loads of context as suggested in the article but generally feels like a miss.

honeybadger1 4 days ago

Gemini still, for me, feels like the king for speed and accuracy.

GardenLetter27 3 days ago

Gemini 2.5 Pro is incredible - both in coding and text review.
DeepSeek isn't bad either (especially given its age now), and Claude is great for coding and tool use but too damn expensive.

conradfr 3 days ago

Nitpicking but this page is not practical to share has there's no individual url per post (AFAIK) (the # part is not picked up by Slack etc to generate preview).

paul7986 4 days ago

GPT needs way better image creation! Today I asked it to create a full image of a 2025 calendar highlighting all weekday workdays excluding federal holidays. At the bottom of legend tell me how many weekday work hours are available within criteria noted.

It created the image showing each month but when you looked at each month it was so janky ... February 31st and other huge errors!

I'm not using image creation to create 3d art for fun or art sake im trying to use it to create utility images to share for discussion with friends & co-workers. The above is just one of many ways it fails when creating utility images!

mkl 4 days ago
Wrong tool for the job. Try asking it to generate an SVG calendar with those features, or to generate Python code that produces an SVG calendar with those features.
- paul7986 3 days ago
  
  Well I just want to create utility images by typing the request in a text box. Im betting a few to maybe a lot of users are trying to create utility images too just by typing in the prompt.
  I shouldnt need to know how to do that as a GPT and or AI user ... the AI should just do it for the user via their request in the text box. That's the magic of AI to me.
- catlifeonmars 4 days ago
  
  That makes sense. Naively, one would expect this to be the type of reasoning that it should “figure out” on its own.

mmsc 5 days ago

I understand that things are moving fast and all, but surely the.. 8? models which are currently available is a bit .. overwhelming for users that just want to get answers to their questions of life? What's the end goal with having so many models available?

nickysielicki 5 days ago
I just can’t believe nobody at the company has enough courage to tell their leadership that their naming scheme is completely stupid and insane. Four is greater than three, and so four should be better than three. The point of a name is to describe something so that you don’t confuse your users, not to be cute.
- MallocVoidstar 5 days ago
  
  The reason their naming scheme is so bad is because their initial attempts at GPT-5 failed in training. It was supposed to be done ~1 year ago. Because they'd promised that GPT-5 would be vastly more intelligent than GPT-4, they couldn't just name any random model "GPT-5", so they suddenly had to start naming things differently. So now there's GPT-4.5, GPT-4.1, the o-series, ...
  
  1 reply →
- transcriptase 5 days ago
  
  What’s worse is that the app doesn’t even have descriptions. As if I’m supposed to memorize the use case for each based on:
  GPT-4o
  o3
  o4-mini
  o4-mini-high
  GPT-4.5
  GPT-4.1
  GPT-4.1-mini
  
  5 replies →
- dmos62 4 days ago
  
  If you obfuscate the naming, you obfuscate the value proposition, and people become easier to mislead into choosing an overly expensive model. Same as with Intel CPUs, or many many other hardware products.
- browningstreet 5 days ago
  
  At Techcrunch AI last week, the OpenAI guy started his presentation by acknowledging that OpenAI knows their naming is a problem and they're working on it, but it won't be fixed immediately.
  
  8 replies →
- aetherspawn 5 days ago
  
  Came here to say this, the naming scheme is ridiculous and is getting more impossible to follow each day.
  For example the other day they released a supposedly better model with a lower number..
  
  1 reply →
levocardia 5 days ago

There's a humorous version of Poe's law that says "any sufficiently genuine attempt to explain the differences between OpenAI's models is indistinguishable from parody"
Osyris 5 days ago
This is a much more expensive model to run and is only available to users who pay the most. I don't see an issue.
However, the "plus" plan absolutely could use some trimming.
- djrj477dhsnv 4 days ago
  
  If it's better (and newer) than gpt4, it shouldn't have a lower version number.
bachittle 5 days ago
free users don't have this model selector, and probably don't care which model they get so 4o is good enough. paid users at 20$/month get more models which are better, like o3. paid users at 200$/month get the best models that are also costing OpenAI the most money, like o3-pro. I think they plan to unify them with GPT-5.
- stavros 5 days ago
  
  That doesn't help much when we're asymptotically approaching GPT-5. We're probably going to be at GPT-4.9999 soon.
  
  1 reply →
- nikcub 5 days ago
  
  I'd be curious what proportion of paid users ever switch models. I'd guess < 10%
  
  3 replies →
AtlasBarfed 5 days ago
I'd like one to do my test use case:
Port unix-sed from c to java with a full test suite and all options supported.
Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.
It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.
It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.
GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.
The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.
I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.
- nipah 4 days ago
  
  I have a very simple question with like, 5 lines at best, that basically no model, neither reasoning or simpler could grasp. For obvious reasons I'm not disclosing it here (because I fear data contamination in the long run), but it basically breaks the "reasoning" of those things. Unfortunately, I still can't try the o3-pro because the API version is not easily available, and I'm certainly not willing to pay for it in pro mode, but when it comes to the plus version (if it comes) I'll try. To this date, because of this question (and similar ones) I stand very unimpressed with those models, the marketing is a thousand times larger than reality, and I suspect people in general are surprisingly less capable of detecting intelligence than they think.
  The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.
- jiggawatts 3 days ago
  
  The massive real market here is enterprises that need to rewrite legacy code to modern platforms, retaining the business logic as-is but modernising the style.
  .NET Framework 4.x to .NET 10, Python 2 to 3, Java 8 to <current version>, etc...
  The advantage the LLMs have here is that staying within the same programming language and its paradigm is dramatically simpler than converting a "procedural" language like C to an object-oriented language like Java that has a wildly different standard library.
- CamperBob2 5 days ago
  
  How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT release do on that task? It obviously demands a massive context window.
  
  1 reply →
resters 5 days ago

Models are used for actual tasks where predictable behavior is a benefit. Models are also used on cutting-edge tasks where smarter/better outputs are highly valued. Some applications value speed and so a new, smaller/cheaper model can be just right.
I think the naming scheme is just fine and is very straightforward to anyone who pays the slightest bit of attention.
paxys 5 days ago

> users that just want to get answers to their questions of life
Those users go to chat.openai.com (or download the app), type text in the box and click send.
macawfish 5 days ago

Overwhelming yet pretty underwhelming

nake13 5 days ago

[dead]

cluckindan 5 days ago

[flagged]

Workaccount2 5 days ago
With Gemini 2.5 in AI studio you can now increase the amount of thinking tokens, and it definitely makes a difference. O3 pro is most likely O3 with an expanded thinking token budget.
- energy123 5 days ago
  
  Isn't that just increasing the upper bound on thinking tokens, which is rarely hit even on much lower levels?
- dbbk 5 days ago
  
  Or my favourite, tell Claude to "ultrathink"
- cluckindan 5 days ago
  
  It is not thinking. It is trying to deceive you. The ”reasoning” it outputs does not have a causal relationship with the end result.
  
  4 replies →

carmelion 5 days ago

Jl App