Comment by tedsanders

6 days ago

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

51 comments

tedsanders

Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.

However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).

This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.

IanCal 5 days ago
Are you sure you're using the same models? G2.5P updated almost exactly a week ago.
- fastball 5 days ago
  
  G2.5P might've updated, but that's not the model I noticed a difference. o3 seemed noticeably dumber in isolation, not just compared to G2.5P.
  But yes, perhaps the answer is that about a week ago I started asking subconsciously harder questions, and G2.5P handled them better because it had just been improved, while o3 had not so it seemed worse. Or perhaps G2.5P has always had more capacity than o3, and I wasn't asking hard enough questions to notice a difference before.

fny 5 days ago

Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.

o4-mini-high o4-mini o3 o3-pro gpt-4o

Oy.

energy123 6 days ago

Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.

tedsanders 5 days ago
o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.
o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
- tauntz 5 days ago
  
  Why's it called o3 then if it's a different thing? There's already a rather extreme amount of confusion with the model names and it's not clear _at all_ which model would be "the best" in terms of response quality.
  Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):
  1) prefix (optional): "gpt-", "chatgpt-"
  2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,
  3) quality? (optional): "nano", "mini", "pro", "turbo"
  4) type (optional): "audio", "search"
  5) lifecycle (optional): "preview", "latest"
  6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)
  7) size (optional): "16k"
  Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").
  Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).
  
  6 replies →
- fragmede 5 days ago
  
  Could someone there maybe possibly use, oh I dunno, ChatGPT and come up with some better product names?

MattDaEskimo 6 days ago

What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well

refulgentis 6 days ago
What dropped benchmark performance?
- MattDaEskimo 5 days ago
  
  o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks
  
  3 replies →

meta_ai_x 5 days ago

Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs

bn-l 6 days ago

Not quantized?

tedsanders 6 days ago
Not quantized. Weights are the same.
If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
[1] `chatgpt-4o-latest` being an explicit exception
- linsomniac 6 days ago
  
  >we'd release it as a new model with a new name
  Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.
- thegeomaster 6 days ago
  
  Google could at least learn something from this attitude, given their recent 03-25 -> 05-06 model alias switcharoo with 0 notice :)
  
  22 replies →
- ant6n 6 days ago
  
  It was definitely annoying when o1 disappeared over night, my impression is that was better at some tasks than o3.

csomar 5 days ago

I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).

I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.

This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.

rfoo 5 days ago

An (arbitrarily) quantized model is a totally different model, compared to the original.
Reubachi 5 days ago

I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.
This is HN and not reddit.
"I didn't read the ToS, like everyone else, but my guess..."
Ah, there it is.