Comment by behnamoh

6 days ago

how do we know it's not a quantized version of o3? what's stopping these firms from announcing the full model to perform well on the benchmarks and then gradually quantizing it (first at Q8 so no one notices, then Q6, then Q4, ...).

I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.

150 comments

behnamoh

CSMastermind 6 days ago

This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"

tedsanders 6 days ago
Nope, not what we’re doing.
o3 is still o3 (no nerfing) and o3-pro is new and better than o3.
If we were lying about this, it would be really easy to catch us - just run evals.
(I work at OpenAI.)
- fastball 5 days ago
  
  Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.
  However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
  This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
  
  2 replies →
- fny 5 days ago
  
  Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.
  o4-mini-high o4-mini o3 o3-pro gpt-4o
  Oy.
- energy123 6 days ago
  
  Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.
  
  9 replies →
- MattDaEskimo 6 days ago
  
  What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well
  
  5 replies →
- meta_ai_x 5 days ago
  
  Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs
- bn-l 6 days ago
  
  Not quantized?
  
  26 replies →
- csomar 5 days ago
  
  I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).
  I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
  This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
  
  2 replies →
mliker 6 days ago
Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN
- beering 6 days ago
  
  I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.
behnamoh 6 days ago

> rebranding the original o3 model as "o3-pro"
interesting take, I wouldn't be surprised if they did that.
anticensor 6 days ago
-pro models appear to be a best-of-10 sampling of the original full size model
- Szpadel 6 days ago
  
  how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.
  if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
  but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
  
  4 replies →

lispisok 6 days ago

I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.

Tiberium 6 days ago
I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?
- Kranar 6 days ago
  
  I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.
  When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.
  
  4 replies →
- herval 6 days ago
  
  there's definitely measurements (eg https://hdsr.mitpress.mit.edu/pub/y95zitmz/release/2 ) but I imagine they're rare because those benchmarks are expensive, so nobody keeps running them all the time?
  Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).
  You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).
  Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)
  
  6 replies →
- cainxinth 6 days ago
  
  I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.
- bobxmax 6 days ago
  
  My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.
  Which is why the base model wouldn't necessarily show differences when you benchmarked them.
- colordrops 6 days ago
  
  It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.
- 85392_school 6 days ago
  
  I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.
- tshaddox 6 days ago
  
  Yeah, it’s almost certainly hallucination (by the human user).
JoshuaDavid 6 days ago

I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.
Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).
nabla9 6 days ago
It seems that least Google is overselling their compute capacity.
You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.
- baq 6 days ago
  
  Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…
  
  10 replies →
- edzitron 6 days ago
  
  When you say "jammed," how do you mean?
JamesBarney 6 days ago

I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.
There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.
mhitza 6 days ago
That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.
Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).
- beering 6 days ago
  
  Are you able to quantify how quickly your perception gets skewed by how long you use the models?
  
  1 reply →
beering 6 days ago

It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.
solfox 6 days ago

I have seen this behavior as well.
codr7 6 days ago
[flagged]
- daseiner1 6 days ago
  
  It's still a very competitive marketplace
- mathgradthrow 6 days ago
  
  honestly refreshing take.
- bboygravity 6 days ago
  
  But OpenAI breathes honesty. They're open source! They would never do such a thing. /s

tedsanders 6 days ago

It's the same model, no quantization, no gimmicks.

In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].

[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.

[2] ChatGPT Release Notes document our updates to gpt-4o and other models: https://help.openai.com/en/articles/6825453-chatgpt-release-...

(I work at OpenAI.)

ctoth 6 days ago

From the announcement email:

> Today, we dropped the price of OpenAI o3 by 80%, bringing the cost down to $2 / 1M input tokens and $8 / 1M output tokens.

> We optimized our inference stack that serves o3—this is the same exact model, just cheaper.

hyperknot 6 days ago

I got 700+ tokens/sec on o3 after the announcement, I suspect it's very much a quantized version.

https://x.com/hyperknot/status/1932476190608036243

dist-epoch 6 days ago
Or maybe they just brought online much faster much cheaper hardware.
- az226 6 days ago
  
  Or they are using a speedy add-on decoder.
beering 6 days ago

Do you also have numbers on intelligence before and after?
zackangelo 6 days ago

Is that input tokens or output tokens/s?

carter-0 6 days ago

An OpenAI researcher claims it's the exact same model on X: https://x.com/aidan_mclau/status/1932507602216497608

ants_everywhere 6 days ago

Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.

The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.

SparkyMcUnicorn 6 days ago
The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.
It had a few bugs here or there when they pushed updates, but it didn't get worse.
- ants_everywhere 6 days ago
  
  Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.
  My question is not whether this is true (it is) but why it's happening.
  I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.
  But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.
  
  1 reply →
code_biologist 5 days ago

My use case is mostly creative writing.
IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.
Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.
In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.

esafak 6 days ago

Are there any benchmarks that track historical performance?

behnamoh 6 days ago

good question, and I don't know of any, although it's a no brainer that someone should make it.
a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.
SparkyMcUnicorn 6 days ago
Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.
- esafak 6 days ago
  
  https://aider.chat/docs/leaderboards/by-release-date.html
- __mharrison__ 6 days ago
  
  Updated yesterday... https://aider.chat/docs/leaderboards/
  
  2 replies →

benterix 6 days ago

> users found them more pleasing.

Some users. For me the drop was so huge it became almost unusable for the things I had used it for.

behnamoh 6 days ago

Same here. One of my apps straight out stopped working because the gpt-4o outputs were noticeably worse than the gpt-4 that I built the app based on.

risho 6 days ago

Quantization is a massive efficiency gain for near negligible drop in quality. If the tradeoff is quantization for an 80 percent price drop I would take that any day of the week.

behnamoh 6 days ago

> for near negligible drop in quality
Hmm, that's evidently and anecdotally wrong:
https://github.com/ggml-org/llama.cpp/discussions/4110
spiderice 6 days ago

You may be right that the tradeoff is worth it, but it should be advertised as such. You shouldn't think you're paying for full o3, even if they're heavily discounting it.
code_biologist 5 days ago

I would like the option to pay for the unquantized version. For creative or story writing (D&D campaign materials and such) quantization seems to end up in much weaker word selection and phrasing. There are small semantic missteps that break the illusion the LLM understands what it's writing. I find it jarring and deeply immersion breaking. I'd prefer prototype prompts on a cheaper quantized version, but I want to be able to spend 50 cents an API call to get golden output.

EnPissant 6 days ago

The API lists o3 and o3-2025-04-16 as the same thing with the same price. The date based models are set in stone.

rfoo 5 days ago

I don't work for OAI so obviously I can't say for them. But we don't do this.

We don't make hobbyist mistakes of randomly YOLO trying various "quantization" methods that only happen after all training and claim it a day, at all. Quantization was done before it went live.

Bjorkbat 6 days ago

Related, when o3 finally came out ARC-AGI updated their graph because it didn’t perform nearly as well as the version of o3 that “beat” the benchmark.

https://arcprize.org/blog/analyzing-o3-with-arc-agi

beering 6 days ago
The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better
- Bjorkbat 5 days ago
  
  Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.
  They inflated expectations and then released to the public a model that underperforms
  
  1 reply →

az226 6 days ago

Even classic GPT-4 from March 2023 was quantized to 4.5 bits.

smusamashah 6 days ago

Hw about testing same input vs output with same seed on different dates. If its a different model it will return different output.

zomnoys 6 days ago
Isn’t this not true since these models run with a non-zero temperature?
- smusamashah 6 days ago
  
  You can set the temperature too.

resters 6 days ago

It's probably optimized in some way, but if the optimizations degrade performance, let's hope it is reflected in various benchmarks. One alternative hypothesis is that it's the same model, but in the early days they make it think "harder" and run a meta-process to collect training data for reinforcement learning for use on future models.

SparkyMcUnicorn 6 days ago
It's a bit dated now, but it would be cool if people submitted PRs for this one: https://aider.chat/docs/leaderboards/by-release-date.html
- __mharrison__ 6 days ago
  
  Dated? This was updated yesterday https://aider.chat/docs/leaderboards/
  
  1 reply →

luke-stanley 6 days ago

I think the API has some special IDs to check for reproducibility of the environment.

jstummbillig 6 days ago

You can just give it a go for very little money (in Windsurf it's 1x right now), and see what it does. There is no room for conspiracy here, because you can simple look at what it does. If you don't like it, so won't others, and then people will not use it. People are obviously very capable of (collectively) forming opinions on models, and then vote with their wallet.

segmondy 6 days ago

you don't, so run your own model.