OpenAI o3 and o4-mini

2 months ago (openai.com)

526 comments

maheshrijal

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

int_19h 2 months ago
Compare to Gemini Pro 2.5:
https://g.co/gemini/share/c8fb1c9795e4
Of note, the final step in the CoT is:
> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.
and then the response is in line with that.
- M4v3R 2 months ago
  
  I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.
werdnapk 2 months ago
I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
- SkyPuncher 2 months ago
  
  There's a bit of a skill to it.
  Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.
  I'll often end up with a task that looks something like this:
  * Implement Foo with a relation to FooBar.
  * Foo should have X, Y, Z features
  * We have an existing pattern for Fidget in BigFidget. Look at that for implementation
  * Make sure you account for A, B, C. Check Widget for something similar.
  It works surprisingly well.
  
  5 replies →
- hatefulmoron 2 months ago
  
  It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).
  I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.
  
  6 replies →
- mikepurvis 2 months ago
  
  I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.
  (This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)
- chaboud 2 months ago
  
  I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).
  That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.
  For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.
  
  3 replies →
- ecocentrik 2 months ago
  
  People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.
- killerdhmo 2 months ago
  
  I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366
- motorest 2 months ago
  
  > I've used AI with "niche" programming questions and it's always a total let down.
  That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.
  I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively
  > I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
  Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.
lend000 2 months ago

I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.
What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.
siva7 2 months ago
It can imitate its creator. We reached AGI.
- casinoplayer0 2 months ago
  
  I wanted to believe. But not now.
hirvi74 2 months ago

Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.
shultays 2 months ago
AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions
- felipeerias 2 months ago
  
  LLMs made me a lot more aware of leading questions.
  Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.
M4v3R 2 months ago

Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b...
So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer.
shmerl 2 months ago
How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.
- M4v3R 2 months ago
  
  I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.
Davidzheng 2 months ago

Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think.
tern 2 months ago

What's the correct answer? Curious if it got it right the second time: https://chatgpt.com/share/68009f36-a068-800e-987e-e6aaf190ec...
heavyset_go 2 months ago

Same thing happened when asking it a fairly simple question about dracut on Linux.
If I went through with the changes it suggested, I wouldn't have a bootable machine.
yMEyUyNE1 2 months ago
> Not to lie me in the face.
Are you saying that, it deliberately lied to you?
> With right knowledge and web searches one can answer this question in a matter of minutes at most.
Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter.
- M4v3R 2 months ago
  
  > Are you saying that, it deliberately lied to you?
  Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).
mountainriver 2 months ago
Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts
- dragonmost 2 months ago
  
  But then how can you rely on it for things you don't know the answer to? The exercise just goes to show it still can't admit it doesn't know and lies instead.

erikw 2 months ago

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

danpalmer 2 months ago

Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.
https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.
As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?
tymscar 2 months ago

Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it
bool3max 2 months ago

I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?
peterldowns 2 months ago
If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.
- ZeroTalent 2 months ago
  
  I was a major contributor of Flake. What in particular is so idiotic in your opinion?
  
  6 replies →
- brailsafe 2 months ago
  
  I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.
  Just jokes, idk anything about either.
  \s
ai-christianson 2 months ago

> Interesting... I asked o3 for help writing...
What tool were you using for this?

georgewsinger 2 months ago

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

jjani 2 months ago
Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.
Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.
- unsupp0rted 2 months ago
  
  Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.
  
  17 replies →
- armen52 2 months ago
  
  I don't understand this assertion, but maybe I'm missing something?
  Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...
- amedviediev 2 months ago
  
  I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.
- spaceman_2020 2 months ago
  
  I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence
- redox99 2 months ago
  
  2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.
- saberience 2 months ago
  
  Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.
pizzathyme 2 months ago
The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.
There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.
- mchusma 2 months ago
  
  Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.
  Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...
  
  1 reply →
- ilaksh 2 months ago
  
  wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.
  Are you sure that's not 4o?
  
  3 replies →
- Agentus 2 months ago
  
  also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.
oofbaroomf 2 months ago
Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.
[0] swebench.com/#verified
- georgewsinger 2 months ago
  
  Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:
  > For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.
  Arguably this shouldn't be counted though?
  [1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...
  
  3 replies →
- awestroke 2 months ago
  
  OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt
- swyx 2 months ago
  
  they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet
lattalayta 2 months ago
I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks
- mickael-kerjean 2 months ago
  
  The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:
  switch(testFile) { case "test1.ase": // run this because it's a particular case case "test2.ase": // run this because it's a particular case default: // run something that's not working but that's ok because the previous case should // give the right output for all the test files ... }
- emp17344 2 months ago
  
  That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.
knes 2 months ago

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...
The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent
thefourthchime 2 months ago

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.
ksec 2 months ago

I often wonder if we could expect that to reach 80% - 90% within next 5 years.

osigurdson 2 months ago

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

sebzim4500 2 months ago
Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914
- osigurdson 2 months ago
  
  As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.
  However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.
  In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.
  
  3 replies →
- AaronAPU 2 months ago
  
  I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.
  But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.
  But they are all quite similar and so far these new models are similar but faster IMO.
croemer 2 months ago
I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.
The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.
Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.
And the CoT summary keeps mentioning my name which is irritating.
- istjohn 2 months ago
  
  It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.
  
  1 reply →
- beefnugs 2 months ago
  
  Have you tried cutting the job up into a series of smaller verifiable intermediate steps?
NiloCK 2 months ago
I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.
- glial 2 months ago
  
  I guess you could consider it a lossy encoding.

jcynix 2 months ago

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

WhatIsDukkha 2 months ago
I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.
I would also never ask a coworker for this precise number either.
- achierius 2 months ago
  
  But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".
  
  1 reply →
- jcynix 2 months ago
  
  My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.
  And, BTW, I thought that LLMs are computers too ;-0
  
  1 reply →
- stavros 2 months ago
  
  First we wanted to be able to do calculations really quickly, so we built computers.
  Then we wanted the computers to reason like humans, so we built LLMs.
  Now we want the LLMs to do calculations really quickly.
  It doesn't seem like we'll ever be satisfied.
  
  1 reply →
- ec109685 2 months ago
  
  These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.
pixl97 2 months ago
So I asked GPT-o4-mini-high
"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"
It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.
"The new moon in August 2025 falls on Friday, August 22, 2025"
Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.
- phoe18 2 months ago
  
  Response from Gemini 2.5 Pro for comparison -
``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.
In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```
- jcynix 2 months ago
  
  "Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.
  My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.
  
  1 reply →
- ec109685 2 months ago
  
  Even with a knowledge cutoff, you could know when a future new moon would be.
andrewinardeer 2 months ago
"Who was the President of the United States when Neil Armstrong walked on the moon?"
Gemini 2.5 refuses to answer this because it is too political.
- staticman2 2 months ago
  
  Gemini 2.5 is not generating that refusal. It's a separate censorship model.
  It's more clear when you try via AI studio where that have censorship level toggles.
- croemer 2 months ago
  
  I call bs on this: https://g.co/gemini/share/ed38e9d38b02
  
  2 replies →
throwaway314155 2 months ago

> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
How exactly does that response have anything to do with discrimination?
xnx 2 months ago
Gemini gets the new moon right. Better to use one good model than 5 worse ones.
- kenjackson 2 months ago
  
  I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.
  
  1 reply →

andrethegiant 2 months ago

Buried in the article, a new CLI for coding:

> Codex CLI is fully open-source at https://github.com/openai/codex today.

dang 2 months ago

Related ongoing thread:
OpenAI Codex CLI: Lightweight coding agent that runs in your terminal - https://news.ycombinator.com/item?id=43708025
ipsum2 2 months ago
Looks like a Claude Code clone.
- jumpCastle 2 months ago
  
  But open source like aider

zapnuk 2 months ago

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

Workaccount2 2 months ago
Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.
- stogot 2 months ago
  
  Apple has commercials for a decade making fun of “PCs”
oofbaroomf 2 months ago

They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.
BeetleB 2 months ago

Pricing is already available:
https://platform.openai.com/docs/pricing

carlita_express 2 months ago

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

anothermathbozo 2 months ago
This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.
- carlita_express 2 months ago
  
  I am aware of that, like I said:
  > (Or at least because O(log) increases in model performance became unreasonably costly?)
  But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.
  OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.
  
  2 replies →
og_kalu 2 months ago

It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.
Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.

testfrequency 2 months ago

As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.

1123581321 2 months ago

I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.
energy123 2 months ago
Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.
- hollerith 2 months ago
  
  Huh. I use Gemini 2.0 Flash for many things because it's several times faster than 2.5 Pro.
  
  2 replies →
- thom 2 months ago
  
  Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.
  
  2 replies →
- blueprint 2 months ago
  
  how do you deal with the fact that they use all of your data for training their own systems and review all conversations
  
  3 replies →
yoyohello13 2 months ago

The answer is to just use the latest Claude model and not worry beyond that.
darioush 2 months ago
It's becoming a bit like iphone 3, 4... 13, 25...
Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.
The big step function happened when they could search the web but not much else has changed in my limited experience.
- refulgentis 2 months ago
  
  This is a very apt analogy.
  It confers to the speaker confirmation they're absolutely right - names are arbitrary.
  While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.
n2d4 2 months ago

This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.
boznz 2 months ago

It feels like all the AI companies are pulling the versions out of their arse at the moment, I think they should work backwards and work to AGI 1.0
So my guess currently is that most are lingering at about 0.3
CamperBob2 2 months ago
I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help
- ithkuil 2 months ago
  
  Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion
tempaccount420 2 months ago

As another consumer, I think you're overreacting, it's not that bad.
fkyoureadthedoc 2 months ago
[flagged]
- testfrequency 2 months ago
  
  I’m assuming when you say “read once”, that implies reading once every single release?
  It’s confusing. If I’m confused, it’s confusing. This is UX 101.
- sebzim4500 2 months ago
  
  Aside from anything else, having one model called o4 and one model called 4o is confusing. And I know they haven't released o4 yet but still.
  
  1 reply →
- czk 2 months ago
  
  "good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"
  
  1 reply →
- mrits 2 months ago
  
  Some people don't blindly trust the marketing department of the publisher
  
  1 reply →

burke 2 months ago

It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

TuxSH 2 months ago

They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:
"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."
with rate limits unchanged
wilg 2 months ago
They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.
- drcongo 2 months ago
  
  Or maybe OpenAI could wait until they'd released it before telling people to use it now.
  
  3 replies →
beejiu 2 months ago
Why pay $200/mo when you can just access the models from the Platform playground?
- alphabettsy 2 months ago
  
  Higher limits and operator access maybe?
_bin_ 2 months ago

I see o4-mini on the $20 tier but no o3.
brcmthrowaway 2 months ago

Holy crap... thats expensive.

rsanheim 2 months ago

`ETOOMANYMODELS`

Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...

Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

departed 2 months ago

LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.
https://lmarena.ai/
In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.
In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.
In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.
ac29 2 months ago

> Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development?
Below is a spreadsheet I bookmarked from a previous HN discussion. Its information dense but you can just look at the composite scores to get a quick idea how things compare.
https://docs.google.com/spreadsheets/u/1/d/1foc98Jtbi0-GUsNy...
Carbonhell 2 months ago

I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.

brap 2 months ago

Where's the comparison with Gemini 2.5 Pro?

gallerdude 2 months ago
For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.
Gemini 2.5 Pro got 72.9%
o3 high gets 81.3%, o4-mini high gets 68.9%
- croemer 2 months ago
  
  Isn't it easy to train on the specific Exercism exercises that this benchmark uses?
- vessenes 2 months ago
  
  where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.
  
  2 replies →
- jumpCastle 2 months ago
  
  It was a good benchmark until it entered the training set.
- asadm 2 months ago
  
  thanks
SweetSoftPillow 2 months ago
Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.
On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.
We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).
- usaar333 2 months ago
  
  > Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.
  It's the opposite. o3 scores higher
  
  1 reply →
kridsdale1 2 months ago

Exactly.

jdross 2 months ago

The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating

achierius 2 months ago
How is this a notable release? It's strictly worse than Gemini 2.5 on coding &c, and only an iterative improvement over their own models. The only thing that struck me as particularly interesting was the native visual reasoning.
- og_kalu 2 months ago
  
  It's not worse on coding. SWE Bench, Aider, live bench coding all show noticeably better results.
qoez 2 months ago
Lots of releases but very little actual performance increases
- int_19h 2 months ago
  
  Sonnet and Gemini saw fairly substantial perf increases recenly
  
  3 replies →
emp17344 2 months ago
Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.
- Workaccount2 2 months ago
  
  Integration is accelerating rapidly. Even if model development froze today, we would still probably have ~5 years of adoption and integration before it started to level off.
  
  1 reply →
- nwienert 2 months ago
  
  ChatGPT 3 : iPhone 1
  A bunch of models later, we're about on the iPhone 4-5 now. Feels about right.
  
  2 replies →
- adncors 2 months ago
  
  But we're seeing incremental improvements every two months, so...

ApolloFortyNine 2 months ago

Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

They even provide a description in the UI of each before you select it, and it defaults to a model for you.

If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.

brokencode 2 months ago
I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.
But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.
- FergusArgyll 2 months ago
  
  I thought sama said that that's the plan for gpt-5: a router which'll choose the right model and thinking level for you
- CaptainFever 2 months ago
  
  Would that be considered a Mixture of Experts system?
  
  1 reply →

andai 2 months ago

The most striking difference to me is that o3 and o4 know when the web search tool is unavailable, and will tell you they can't answer a question that requires it. While 4o and (sadly) 4.1 will just make up a bunch of nonsense.

I'm simultaneously impressed that they can do that, and also wondering why the heck that's so impressive (isn't "is this tool in this list?" something GPT-3 was able to handle?) and why 4.1 still fails at it too—especially considering it's hyped as the agentic coder model!

That's pretty damning for the general intelligence aspect of it, that they apparently had to special-case something so trivial... and I say that as someone who's really optimistic about this stuff!

That being said, the new "enhanced" web search seems great so far, and means I can finally delete another stupid 10 line Python script from 2023 that I shouldn't have needed in the first place ;)

(...Now if they'd just put 4.1 in the Chat... why the hell do I need to use a 3rd party UI for their best model!)

thm 2 months ago

I'm starting to be reminded of the razor blade business.

Jordan-117 2 months ago

Fuck Everything, We're Doing o5

jawiggins 2 months ago

In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.

To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.

Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.

wg0 2 months ago

If you download GIMP, Blender etc - every user would have to report exactly the same experience mostly given the hardware is recent.

In this thread however - there are varying experiences from amazing to awful. I'm not saying anyone is wrong but all I'm saying is that this wide range of operational accuracy is what will pop the AI bubble eventually in that they can't be reliably deployed almost anywhere with any certainty or guarantees of any sorts.

meetpateltech 2 months ago

o3 is cheaper than o1. (per 1M tokens)

• o3 Pricing:

  - Input: $10.00  

  - Cached Input: $2.50  

  - Output: $40.00

• o1 Pricing:

  - Input: $15.00  

  - Cached Input: $7.50  

  - Output: $60.00

o4-mini pricing remains the same as o3-mini.

evaneykelen 2 months ago

A suggestion for OpenAI to create more meaningful model names:

{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

Where:

* Size is XS/S/M/L/XL/XXL to indicate overall capability level

* Quarter/Year like Q2-25

* Speed/Accuracy indicated as Fast/Balanced/Precise

* Optional specialty tag like Code/Vision/Science/etc

Example model names:

* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)

oofbaroomf 2 months ago

This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.
jsnell 2 months ago

I think they should name them after fictional characters. Bonus points if they're trademarked characters.
"You gotta try Mickey, it beats the crap out of Gandalf in coding."
pembrook 2 months ago

Thank god we don’t usually let engineers name stuff in the west.
While this is entirely logical in theory this is how you get LG style naming like “THE ALL NEW LG-CFT563-X2”
I mean, it makes total sense, it tells you exactly the model, region, series and edition! Right??
LanceJones 2 months ago

What about using Marvel superhero names (with permission, of course)? The studio keeps giving us stronger and stronger examples...

_fat_santa 2 months ago

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

shmatt 2 months ago
Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?
- pants2 2 months ago
  
  Remember that Docusign has 7,000 employees. I think OpenAI is pretty lean for what they're accomplishing.
  
  15 replies →
- fsndz 2 months ago
  
  True.
  Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well. Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that. https://medium.com/thoughts-on-machine-learning/why-sam-altm...
- stavros 2 months ago
  
  > Im old enough to remember the mystery and hype before o*/o1/strawberry
  So at least two years old?
  
  6 replies →
- irthomasthomas 2 months ago
  
  Yeah, I don't know exactly what at an AGI model will look like, but I think it would have more than 200k context window.
  
  10 replies →
- chrsw 2 months ago
  
  I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.
  But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.
  
  3 replies →
- MoonGhost 2 months ago
  
  > Now we're up to o4, AGI is still not even in near site (depending on your definition, I know)
  It's not only definition. Some googler was sure their model was conscious.
- actsasbuffoon 2 months ago
  
  Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.
  They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.
  I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.
  I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.
  
  18 replies →
- LinuxAmbulance 2 months ago
  
  > We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
  I wonder if any of the people that quit regret doing so.
  Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"
  How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.
leesec 2 months ago
"haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage
- caconym_ 2 months ago
  
  Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.
  It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.
  
  1 reply →
- littlestymaar 2 months ago
  
  ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.
  The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).
  
  7 replies →
- amarcheschi 2 months ago
  
  I guess it was related to the last period, rather than the full picture
  
  1 reply →
- spaceywilly 2 months ago
  
  They have 500M weekly users now. I would say that counts as doing something.
  
  1 reply →
- iLoveOncall 2 months ago
  
  ChatGPT was released in 2022, so OP's point stands perfectly well.
  
  12 replies →
- swyx 2 months ago
  
  seriously. the level of arrogance combined with ignorance is awe inspiring.
  
  1 reply →
nopinsight 2 months ago
Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.
https://x.com/METR_Evals/status/1912594122176958939
—-
The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.
o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.
Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.
It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.
Ref:
- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...
- https://livebench.ai/#/
- kadushka 2 months ago
  
  Imagenet had improved the error rate by 100*11/25=44%.
  o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.
  But these are meaningless comparisons because it’s typically harder to improve already good results.
paxys 2 months ago
OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.
- astrange 2 months ago
  
  They have an enterprise business too. I think it's relevant for that.
  
  1 reply →
- MoonGhost 2 months ago
  
  $20 Plus subscription give access to o1 and Deep Research (10 uses/month). I'm pretty sure general public can get access through API as well.
  
  1 reply →
w10-1 2 months ago
> This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much
Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.
Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.
As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.
As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!
- sksxihve 2 months ago
  
  > Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about
  Or make important investors happy, they need to justify the latest $40 billion round
amarcheschi 2 months ago
The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better
- kylehotchkiss 2 months ago
  
  This sounds like recent Dell and Lenovo strategies
  
  2 replies →
- MoonGhost 2 months ago
  
  not only that. filling search lists on eBay with your products is old sellers' tactics. Try to search for used Dell workstation or server and you will see pages and pages from the same seller.
kristofferR 2 months ago
To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).
On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.
- drcongo 2 months ago
  
  As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?
  Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.
  If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.
  
  18 replies →
mrcwinn 2 months ago

Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.
The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.
whalesalad 2 months ago

Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?
I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.
Dealing with OpenAI's API's is a straight up nightmare.
crowcroft 2 months ago
Most industries, or categories go through cycles of fragmentation and consolidation.
AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.
When the models start to plateau or the demands on the industry are for profit you will see consolidation start.
- airstrike 2 months ago
  
  having many models from the same company in some haphazard strategy doesn't equate to "industry fragmentation". it's just confusion
  
  2 replies →
resters 2 months ago
They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.
Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.
- t-writescode 2 months ago
  
  I have *no idea* why you're being downvoted on this.
  If I want to take advantage of a new model, I must validate that the structured queries I've made to the older models still work on the new models.
  The last time I did a validation and update. Their Responses. Had. Changed.
  API users need dependability, which means they need older models to keep being usable.
  
  1 reply →
jstummbillig 2 months ago
I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.
- siva7 2 months ago
  
  Is there some new HN with more insightful discussions?
- flkenosad 2 months ago
  
  It's giving "they took our jerbs"
Seattle3503 2 months ago

This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.
kgeist 2 months ago
> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model
OpenAI's progress lately:
2024 December - first reasoning model (official release) 2025 February - deep search 2025 March - true multi-modal image generation 2025 April - reasoning model with tools
I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)
jasondigitized 2 months ago

If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.
vunderba 2 months ago

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much.
Did you miss the 4o image generation announcement from roughly three week ago?
https://mordenstar.com/blog/chatgpt-4o-images
irthomasthomas 2 months ago

That would explain why they all have a knowledge cutoff (likely training date) of ~August 2023.
wilg 2 months ago

There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.
danielmarkbruce 2 months ago

Think for 30 seconds about why they might in good faith do what they do.
Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.
ren_engineer 2 months ago

you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best
onlyrealcuzzo 2 months ago

> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.
That's not a problem in and of itself. It's only a problem if the models aren't good enough.
Judging by ChatGPT's adoption, people seem to think they're doing just fine.

simonw 2 months ago

Here's a summary of this conversation so far, generated using o3 after 306 comments. This time I ran it like so:

  llm install llm-openai-plugin
  llm install llm-hacker-news
  llm -m openai/o3 -f hn:43707719 -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

https://gist.github.com/simonw/a35f39b070978e703d9eb8b1aa7c0... - cost 2,684 input, 2,452 output (of which 896 were reasoning tokens) which is 12.492 cents.

Then again with o4-mini using the exact same content (hence the hash ID for -f):

  llm -m openai/o4-mini \
    -f f16158f09f76ab5cb80febad60a6e9d5b96050bfcf97e972a8898c4006cbd544 \
  -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Output: https://gist.github.com/simonw/b11ba0b11e71eea0292fb6adaf9cd...

Cost 2,684 input, 2,681 output (of which 1,088 reasoning tokens) = 1.4749 cents

The above uses these two plugins: https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-hacker-news - taking advantage of new -f "fragments" feature I released last week: https://simonwillison.net/2025/Apr/7/long-context-llm/

iamronaldo 2 months ago

Tyler cowen seems convinced https://marginalrevolution.com/marginalrevolution/2025/04/o3...

jonahx 2 months ago
It can't solve this puzzle: https://i.imgur.com/AJqbqHJ.png
Thought for 3m 51s Short answer → you can’t.
The breathtaking thing is not the model itself, but that someone as smart as Cowen (and he's not the only one) is uttering "AGI" in the same sentence as any of these models. Now, I'm not a hater, and for many tasks they are amazing, but they are, as of now, not even close to AGI, by any reasonable definition.
- neonbjb 2 months ago
  
  I work for openai.
  o4-mini gets much closer (but I'm pretty sure it fumbles at the last moment): https://chatgpt.com/share/680031fb-2bd0-8013-87ac-941fa91cea...
  We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.
  
  5 replies →
- AIPedant 2 months ago
  
  I think it is AGI, seriously. Try asking it lots of questions, and then ask yourself: just how much smarter was I expecting AGI to be?
  That's his whole argument!!!! This is so frustrating coming from a public intellectual. "You don't need rigorous reasoning to answer these questions, baybeee, just go with your vibes." Complete and total disregard for scientific thinking, in favor of confirmation bias and ideology.
spprashant 2 months ago

Tyler Cowen is someone I take seriously. I think he is one of the most rational thought leaders.
But I have to say, his views on LLMs seem a little premature. He definitely has a unique viewpoint of what "general intelligence" is, which might not apply broadly to most jobs. I think "interviews" them like they were a guest on his podcast and bases his judgement on how they compare to his other extremely smart guests.

falleng0d 2 months ago

Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.

rpgbr 2 months ago

This post[1] is highlighted by Techmeme:

>I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)

Genuinely intrigued by what kind of “psychological/emotional problem I've been dealing with for years” could an AI solve in a matter of hours after its release.

[1] https://x.com/carmenleelau/status/1912645771955962300

roskelld 2 months ago

After refreshing the browser I see that the old o3-mini-high has gone now so I continued my coding task conversation with o4-mini-high. In two separate conversations it butchered things in a way that I never saw o3-mini-high do. In one case it rewrote working code without reason, breaking it, in the other it took a function I asked it to apply a code fix to and it instead refactored it with a different and unrelated function that was part of an earlier bit of chat history.

I notice too that it employs a different style of code where it often puts assignment on a different line, which looks like it's trying to maintain an ~80 character line limit, but does so in places where the entire line of code is only about 40 characters.

upbeat_general 2 months ago

Not saying it’s for sure the case but it might be that the model gets confused by OOD text from the other model whereas it expects its own text to be online from itself (particularly if the CoT is used as context for later conversations).

bli940505 2 months ago

I’m having very mixed feelings about it. I’m using o3 to help me parse and understand a book about statistics and ML, it’s very dense in math.

On one hand the answers became a lot more comprehensive and deep. It’s now able to give me very advanced explanations.

On the other hand, it started overloading the answers with information. Entire concepts became single sentence summaries. Complex topics and theorems became acronyms. In a way I’m feeling overwhelmed by the information it’s now throwing at me. I can’t tell if it’s actually smarter or just too complicated for me to understand.

Havoc 2 months ago
Pretty wild that we’re at the point that the human is the limitation
- sealeck 2 months ago
  
  Surprise, the machine that interpolates from a database of maths books confuses a human who wants to learn about the contents of the books in that database.

caseyy 2 months ago

The demo video is very impressive, and it shows what AI could be. Our current models are unreliable in research, but if they were reliable, then what's shown alone would be better than AGI.

There are 8 billion+ instances of general intelligence on the planet; there isn't a shortage. I'd rather see AI do data science and applied math at computer speeds. Those are the hard problems, a lot of the AGI problems (to human brains) are easy.

siva7 2 months ago

So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?

basisword 2 months ago

The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.

throwaway13337 2 months ago

o4-mini is available on vs code. I've been playing with it for the last couple of hours. It's quite fast for a thinking model.

It's also super concise with code. Where claude 3.7 and gemini 2.5 will write a ton, o4-mini will write a tiny portion of it accomplishing the same task.

On the flip side, in its conciseness, it's more lazy with implementation than the other leading models missing features.

For fixing very complex typescript types, I've previously found that o1 outperformed the others. o4-mini seems to understand things well here.

I still think gemini will continue to be my favorite model for code. It's more consistent and follows instructions better.

However, openAI's more advanced models have a better shot at providing a solution when gemini and claude are stuck.

Maybe there's a win here in having o4-mini or o3 do a first draft for conciseness, revise with gemini to fill in what's missed (but with a base that is not overdone), and then run fixes with o4-mini.

Things are still changing quite quickly.

Sol- 2 months ago

Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.

littlestymaar 2 months ago
There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.
- steinvakt2 2 months ago
  
  But isn't this basically what the conv layer does...?

jumploops 2 months ago

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/

fpgaminer 2 months ago

On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.

Both seem to be better at prompt following and have more up to date knowledge.

But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.

sbochins 2 months ago

So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.

WhitneyLand 2 months ago

So it looks like no increase in context window size since it’s not mentioned anywhere.

I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.

AcerbicZero 2 months ago

I can't even get ChatGPT to tell me which chatgpt to use.

klasko 2 months ago

FWIW, o4-mini-high does not feel better o3-mini-high for working on fairly simply econ theory proofs. It does feel faster. And both elementary mistakes.

mianos 2 months ago

I have been using o4-mini-high today. Most of the time for a file longer than 100 lines it stops generating randomly and won't complete a file unless I re-prompt it with the end of the missing file.

As usual, it's a frustrating experience for anything more complex than the usual problems everyone else does.

kurtis_reed 2 months ago

I thought they weren't going to release o3 and it would just be bundled into "GPT-5".

rahimnathwani 2 months ago

  ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.

I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).

oofbaroomf 2 months ago
Same...
- oofbaroomf 2 months ago
  
  It's there now in the web app for me.
  
  1 reply →

pton_xd 2 months ago

This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s

bufferoverflow 2 months ago

JS framework thing is still ongoing
https://krausest.github.io/js-framework-benchmark/2025/table...

EcommerceFlow 2 months ago

A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.

Good thing I stopped working a few hours ago

EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(

iandanforth 2 months ago

o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.

So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.

hybrid_study 2 months ago

Doesn't achieving AGI mean the beginning of the end of humanity's current economic model? I'm not sure I understand the presumption by many that achieving AGI is just another step in some company's offering.

BriggyDwiggs42 2 months ago

No you see because everyone will become agi engineers actually that makes sense and is going to happen
JFingleton 2 months ago

Most days I feel the same.
Other days I remember that humans like "handmade" furniture, and live performances, and unique styles, and human contact.
Perhaps there's life in us still?

typs 2 months ago

I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?

sho_hn 2 months ago

I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.
drvladb 2 months ago

o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.

immibis 2 months ago

https://transluce.org/investigating-o3-truthfulness

Some interesting hallucinations going on here!

pcdoodle 2 months ago

It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?

It has been getting better IMO.

momoelz 2 months ago

I find o4 very bad at coding. I tried to improve a script created by 3.5 mini-high with o4 mini-high and it doesn't return nearly as good results as what i used to get by o3.5

xqcgrek2 2 months ago

Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5

jdlyga 2 months ago

At this point, it's like comparing the iPhone 5s vs the iPhone 6. The upgrades are still noticeable, but it's nowhere the huge jump between GPT 3.5 and GPT 4.

AbuAssar 2 months ago

I noticed that OpenAI don't compare their models to third party models in their announcement posts, unlike google, meta and the others.

djohnston 2 months ago

Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.

sebzim4500 2 months ago

They are replacing o1 with o3 in the UI, at least for me, so they must be pretty confident it is a strict improvement.

thom 2 months ago

o4 is doing a better job than o3 on my current project, and while this isn’t really a priority, its personality is somehow far more engaging now.

Davidzheng 2 months ago

I find it worse than Gemini 2.5 Pro at math research.

benoau 2 months ago

> Downloaded an untouched char.lgp from the current Steam build (1.0.9) to make sure the count reflects the shipping game rather than a modded archive.

How?

og_kalu 2 months ago

o3 joins gemini-2.5-pro as the only other model that can pace long form creative writing properly when details about the story are provided.

oofbaroomf 2 months ago

When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).

wilg 2 months ago

> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.
https://openai.com/index/introducing-o3-and-o4-mini/

neya 2 months ago

The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.

sebzim4500 2 months ago
Meanwhile we have people elsewhere in the thread complaining about too many models.
Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.
- kgeist 2 months ago
  
  >Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.
  Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.
  I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.
  The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.

originalvichy 2 months ago

Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?

planb 2 months ago
Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).
- rahimnathwani 2 months ago
  
  It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.
ipsum2 2 months ago

LLMs could not use tools on day one.

oofbaroomf 2 months ago

Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.

firejake308 2 months ago

Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?

maheshrijal 2 months ago
This might be their answer to claude code more than anything else.
- mpaepper 2 months ago
  
  Yes, that's exactly what I thought as well. An attempt to get more share in the developer tooling space for the long term.
- sho_hn 2 months ago
  
  Looks more like a direct competitor to Aider.
  
  3 replies →

taytus 2 months ago

This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1

Why are they doing this?

bratao 2 months ago

Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.

spencersolberg 2 months ago

The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus

bloqs 2 months ago

I'm confused. I typically use o1 for all of my questions. Now it's disappeared. Is o3 a better model?

euph0ria 2 months ago

Yes, in almost all aspects if you do not use the o1-pro. o3-pro is not available yet.

simianwords 2 months ago

I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.

davidkunz 2 months ago

I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.

oofbaroomf 2 months ago

Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.

cryptoz 2 months ago

I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.

croemer 2 months ago

I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?

kumarm 2 months ago

Anyone got codex working? After installing and setting up API Key I get this error :

    system
      OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

ksylvest 2 months ago

Are these available via the API? I'm getting back 'model_not_found' when testing.

highfrequency 2 months ago

The benchmarks reference o3-low, medium and high. What is plain “o3”? Is that medium?

lubitelpospat 2 months ago

Sooo... are any of these (or their distils) getting open-sourced/open-weighted?

dr_kiszonka 2 months ago

I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)

DrNosferatu 2 months ago

Any good leaderboard where all the very latest models are compared?

siliconc0w 2 months ago

Please just give me a best value and a highest performance model.

morkalork 2 months ago

If the ai is smart, why not have it choose the model for the user

zvitiate 2 months ago

That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?

Topfi 2 months ago

I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.

I_am_tiberius 2 months ago

What is again the advantage of pro over plus subscriptions?

postmaster 2 months ago
> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.
- I_am_tiberius 2 months ago
  
  Ok, so currently they pay for nothing (or is o1-pro superior to o3?).

eric-p7 2 months ago

Babe wake up a new LLM just dropped.

shanecp 2 months ago

Here are some notes I made to understand each of these models and when to use them.

# OpenAI Models

## Reasoning Models (o-series) - All `oX` (o-series aka `omni`) models are reasoning models. - Use these for complex, multi-step, reasoning tasks.

## Flagship/Core Models - All `x.x` and `Xo` models are the core models. - Use these for one-shot results - Examples: 4o, 4.1

## Cost Optimized - All `-mini`, `-nano` are cheaper, faster models. - Use these for high-volume, low effort tasks.

## Flagship vs Reasoning (o-series) Models - Latest flagship model = 4.1 - Latest reasoning model = o3 - The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching. - The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.

# List of Models

## 4o (omni) - 128K context window - complex multimodal, applications requiring the top level of reliability and nuance

## 4o-mini - 128K context window - Use: multimodal reasoning for math, coding, and structured outputs - Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost. - Dont Use: When high accuracy is needed

## 4.1 - 1M context window - Use: For large context ingest, such as full codebases - Use: For reliable instruction following, comprehension - Dont Use: For high volume/faster tasks

## 4.1-mini - 1M context window - Use: For large context ingest - Use: When a tradeoff can be made with accuracy vs speed

## 4.1-nano - 1M context window - Use: For high-volume, near-instant responses - Dont Use: When accuracy is required - Examples: classification, autocompletion, short-answers

## o3 - 200K context window - Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use - Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop - Dont Use: For simple tasks, where lighter model will be faster and cheaper.

## o4-mini - 200K context window - Use: High-volume needs where reasoning and cost should be balanced - Use: For high throughput applications - Dont Use: When accuracy is critical

## o4-mini-high - 200K context window - Use: When o4-mini results are not satisfactory, but before moving to o3. - Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory - Dont Use: When accuracy is critical

## o1-pro-mode - 200K context window - Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency - Dont Use: For simple tasks

## Models Sorted for Complex Coding Tasks (my opinion)

1. o3 2. Gemini 2.5 Pro 3. Claude 3.7 2. o1-pro-mode 3. o4-mini-high 4. 4.1 5. o4-mini

ben_w 2 months ago

4o and o4 at the same time. Excellent work on the product naming, whoever did that.

stavros 2 months ago
Oh, that was Altman Sam.
- ai-christianson 2 months ago
  
  Am Saltman
  
  2 replies →
janderson215 2 months ago

It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.
throwuxiytayq 2 months ago
Just wait until they announce oA and A0.
They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.
- ben_w 2 months ago
  
  Energy Intensive Exceptional Intelligence (Omni-domain), AKA E-I-E-I-O.

tymscar 2 months ago

Gave Codex a go with o4-mini and it's disappointing... Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools: https://xcancel.com/Tymscar/status/1912578655378628847

Workaccount2 2 months ago

o4-mini, not to be confused with 4o-mini.

sks38317 2 months ago

thanks for your information!

planb 2 months ago

What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.

dabeeeenster 2 months ago

It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?
xd1936 2 months ago

Fix coming this summer, hopefully.
https://twitter.com/sama/status/1911906570835022319
i_love_retros 2 months ago

I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)
sho_hn 2 months ago
Seems to me like they're somewhat trying to simplify now.
GPT-N.m -> Non-reasoning
oN -> Reasoning
oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)
It would be nice if they actually stick to this pattern.
- bogtog 2 months ago
  
  I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward
- krackers 2 months ago
  
  But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).
- jagger27 2 months ago
  
  Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.
waltercool 2 months ago

[dead]

behnamoh 2 months ago

OpenAI be like:

    o1, o1-mini,
    o1-pro, o3,
    o4-mini, gpt-4,
    gpt-4o, gpt-4-turbo,
    gpt-4.5, gpt-4.1,
    gpt-4o-mini, gpt-4.1-mini,
    gpt-4.1-nano, gpt-3.5-turbo

waltercool 2 months ago

[dead]

timonofathens 2 months ago

[dead]

mentalgear 2 months ago

I have doubts whether the live stream was really live.

During the live-stream the subtitles are shown line by line.

When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.

ipsum2 2 months ago

All YouTube live streams are like this.
KTibow 2 months ago

I think this is just a quirk of how Google does live captions.