← Back to context

Comment by CSMastermind

6 days ago

This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

  • Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.

    However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).

    This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.

  • Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.

    o4-mini-high o4-mini o3 o3-pro gpt-4o

    Oy.

  • Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.

    • o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.

      o3 pro is a different thing - it’s not just o3 with maximum remaining effort.

      8 replies →

  • Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs

  • Not quantized?

    • Not quantized. Weights are the same.

      If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].

      [1] `chatgpt-4o-latest` being an explicit exception

      25 replies →

  • I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).

    I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.

    This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.

    • An (arbitrarily) quantized model is a totally different model, compared to the original.

    • I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.

      This is HN and not reddit.

      "I didn't read the ToS, like everyone else, but my guess..."

      Ah, there it is.

Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN

  • I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.

> rebranding the original o3 model as "o3-pro"

interesting take, I wouldn't be surprised if they did that.

-pro models appear to be a best-of-10 sampling of the original full size model

  • how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.

    if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time

    but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them

    • > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time

      remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent

    • I believe it is a majority vote kinda thing, rather than a best single result.