← Back to context

Comment by tedsanders

6 days ago

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.

However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).

This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.

  • Are you sure you're using the same models? G2.5P updated almost exactly a week ago.

    • G2.5P might've updated, but that's not the model I noticed a difference. o3 seemed noticeably dumber in isolation, not just compared to G2.5P.

      But yes, perhaps the answer is that about a week ago I started asking subconsciously harder questions, and G2.5P handled them better because it had just been improved, while o3 had not so it seemed worse. Or perhaps G2.5P has always had more capacity than o3, and I wasn't asking hard enough questions to notice a difference before.

Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.

o4-mini-high o4-mini o3 o3-pro gpt-4o

Oy.

Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.

  • o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.

    o3 pro is a different thing - it’s not just o3 with maximum remaining effort.

    • Why's it called o3 then if it's a different thing? There's already a rather extreme amount of confusion with the model names and it's not clear _at all_ which model would be "the best" in terms of response quality.

      Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):

      1) prefix (optional): "gpt-", "chatgpt-"

      2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,

      3) quality? (optional): "nano", "mini", "pro", "turbo"

      4) type (optional): "audio", "search"

      5) lifecycle (optional): "preview", "latest"

      6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)

      7) size (optional): "16k"

      Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").

      Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).

      6 replies →

    • Could someone there maybe possibly use, oh I dunno, ChatGPT and come up with some better product names?

Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs

Not quantized?

  • Not quantized. Weights are the same.

    If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].

    [1] `chatgpt-4o-latest` being an explicit exception

    • >we'd release it as a new model with a new name

      Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.

    • It was definitely annoying when o1 disappeared over night, my impression is that was better at some tasks than o3.

I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).

I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.

This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.

  • An (arbitrarily) quantized model is a totally different model, compared to the original.

  • I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.

    This is HN and not reddit.

    "I didn't read the ToS, like everyone else, but my guess..."

    Ah, there it is.