Comment by CSMastermind

6 days ago

This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"

61 comments

CSMastermind

tedsanders 6 days ago

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

fastball 5 days ago
Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.
However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
- IanCal 5 days ago
  
  Are you sure you're using the same models? G2.5P updated almost exactly a week ago.
  
  1 reply →
fny 5 days ago

Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.
o4-mini-high o4-mini o3 o3-pro gpt-4o
Oy.
energy123 6 days ago
Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.
- tedsanders 5 days ago
  
  o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.
  o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
  
  8 replies →
MattDaEskimo 6 days ago
What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well
- refulgentis 6 days ago
  
  What dropped benchmark performance?
  
  4 replies →
meta_ai_x 5 days ago

Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs
bn-l 6 days ago
Not quantized?
- tedsanders 6 days ago
  
  Not quantized. Weights are the same.
  If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
  [1] `chatgpt-4o-latest` being an explicit exception
  
  25 replies →
csomar 5 days ago
I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).
I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
- rfoo 5 days ago
  
  An (arbitrarily) quantized model is a totally different model, compared to the original.
- Reubachi 5 days ago
  
  I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.
  This is HN and not reddit.
  "I didn't read the ToS, like everyone else, but my guess..."
  Ah, there it is.

mliker 6 days ago

Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN

beering 6 days ago

I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.

behnamoh 6 days ago

> rebranding the original o3 model as "o3-pro"

interesting take, I wouldn't be surprised if they did that.

anticensor 6 days ago

-pro models appear to be a best-of-10 sampling of the original full size model

Szpadel 6 days ago
how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
- anticensor 6 days ago
  
  > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
  remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
- joshstrange 6 days ago
  
  I think the idea is they use another/same model to judge all the results and only return the best one to the user.
  
  1 reply →
- spott 6 days ago
  
  I believe it is a majority vote kinda thing, rather than a best single result.