OpenAI o3 and o4-mini

4 days ago (openai.com)

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

  • Compare to Gemini Pro 2.5:

    https://g.co/gemini/share/c8fb1c9795e4

    Of note, the final step in the CoT is:

    > Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.

    and then the response is in line with that.

    • I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.

  • I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

    • There's a bit of a skill to it.

      Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.

      I'll often end up with a task that looks something like this:

      * Implement Foo with a relation to FooBar.

      * Foo should have X, Y, Z features

      * We have an existing pattern for Fidget in BigFidget. Look at that for implementation

      * Make sure you account for A, B, C. Check Widget for something similar.

      It works surprisingly well.

      4 replies →

    • It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).

      I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.

      6 replies →

    • I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.

      (This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)

    • I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).

      That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.

      For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.

      3 replies →

    • People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.

    • > I've used AI with "niche" programming questions and it's always a total let down.

      That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.

      I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively

      > I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

      Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.

  • I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.

    What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.

  • Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.

  • AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions

    • LLMs made me a lot more aware of leading questions.

      Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.

  • How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.

    • I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.

  • Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think.

  • Same thing happened when asking it a fairly simple question about dracut on Linux.

    If I went through with the changes it suggested, I wouldn't have a bootable machine.

  • > Not to lie me in the face.

    Are you saying that, it deliberately lied to you?

    > With right knowledge and web searches one can answer this question in a matter of minutes at most.

    Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter.

    • > Are you saying that, it deliberately lied to you?

      Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).

  • Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts

    • But then how can you rely on it for things you don't know the answer to? The exercise just goes to show it still can't admit it doesn't know and lies instead.

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

  • Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.

    https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.

    As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?

  • Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it

  • I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?

  • If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.

    • I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.

      Just jokes, idk anything about either.

      \s

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

  • Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

    Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

    • I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.

    • 2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.

    • Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

  • The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

    There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

    • wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.

      Are you sure that's not 4o?

      3 replies →

    • also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

  • Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

    [0] swebench.com/#verified

    • Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

      > For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

      Arguably this shouldn't be counted though?

      [1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

      3 replies →

    • OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

  • I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

    • The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

        switch(testFile) {
          case "test1.ase": // run this because it's a particular case 
          case "test2.ase": // run this because it's a particular case
          default:  // run something that's not working but that's ok because the previous case should
                    // give the right output for all the test files ...
        }

    • That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

  • Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

  • I often wonder if we could expect that to reach 80% - 90% within next 5 years.

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

  • Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

    • As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

      However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

      In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

      3 replies →

    • I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.

      But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.

      But they are all quite similar and so far these new models are similar but faster IMO.

  • I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

    The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

    Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

    And the CoT summary keeps mentioning my name which is irritating.

    • It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.

      1 reply →

  • I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

  • So I asked GPT-o4-mini-high

    "On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"

    It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.

    "The new moon in August 2025 falls on Friday, August 22, 2025"

    Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.

    • Response from Gemini 2.5 Pro for comparison -

      ``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.

      In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```

    • "Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.

      My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.

      1 reply →

  • I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.

    I would also never ask a coworker for this precise number either.

    • My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.

      And, BTW, I thought that LLMs are computers too ;-0

      1 reply →

    • First we wanted to be able to do calculations really quickly, so we built computers.

      Then we wanted the computers to reason like humans, so we built LLMs.

      Now we want the LLMs to do calculations really quickly.

      It doesn't seem like we'll ever be satisfied.

      1 reply →

    • These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.

  • > one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

    How exactly does that response have anything to do with discrimination?

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

  • This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.

    • I am aware of that, like I said:

      > (Or at least because O(log) increases in model performance became unreasonably costly?)

      But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.

      OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.

      2 replies →

  • It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.

    Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.

As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.

  • I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.

  • Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.

    • Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.

      2 replies →

  • It's becoming a bit like iphone 3, 4... 13, 25...

    Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.

    The big step function happened when they could search the web but not much else has changed in my limited experience.

    • This is a very apt analogy.

      It confers to the speaker confirmation they're absolutely right - names are arbitrary.

      While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.

  • This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.

  • It feels like all the AI companies are pulling the versions out of their arse at the moment, I think they should work backwards and work to AGI 1.0

    So my guess currently is that most are lingering at about 0.3

  • I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help

    • Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion

  • [flagged]

    • I’m assuming when you say “read once”, that implies reading once every single release?

      It’s confusing. If I’m confused, it’s confusing. This is UX 101.

    • Aside from anything else, having one model called o4 and one model called 4o is confusing. And I know they haven't released o4 yet but still.

      1 reply →

    • "good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"

      1 reply →

`ETOOMANYMODELS`

Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...

Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

  • LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.

    https://lmarena.ai/

    In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.

    In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.

    In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.

  • I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.

It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

  • They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:

    "ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."

    with rate limits unchanged

  • They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.

Where's the comparison with Gemini 2.5 Pro?

  • For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

    Gemini 2.5 Pro got 72.9%

    o3 high gets 81.3%, o4-mini high gets 68.9%

  • Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

    On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.

    We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).

Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

They even provide a description in the UI of each before you select it, and it defaults to a model for you.

If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.

  • I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.

    But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.

The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating

  • How is this a notable release? It's strictly worse than Gemini 2.5 on coding &c, and only an iterative improvement over their own models. The only thing that struck me as particularly interesting was the native visual reasoning.

    • It's not worse on coding. SWE Bench, Aider, live bench coding all show noticeably better results.

  • Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.

If you download GIMP, Blender etc - every user would have to report exactly the same experience mostly given the hardware is recent.

In this thread however - there are varying experiences from amazing to awful. I'm not saying anyone is wrong but all I'm saying is that this wide range of operational accuracy is what will pop the AI bubble eventually in that they can't be reliably deployed almost anywhere with any certainty or guarantees of any sorts.

In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.

To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.

Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.

o3 is cheaper than o1. (per 1M tokens)

• o3 Pricing:

  - Input: $10.00  

  - Cached Input: $2.50  

  - Output: $40.00

• o1 Pricing:

  - Input: $15.00  

  - Cached Input: $7.50  

  - Output: $60.00

o4-mini pricing remains the same as o3-mini.

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

  • Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

    Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?

    • True.

      Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well. Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that. https://medium.com/thoughts-on-machine-learning/why-sam-altm...

    • I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.

      But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.

      2 replies →

    • > Now we're up to o4, AGI is still not even in near site (depending on your definition, I know)

      It's not only definition. Some googler was sure their model was conscious.

    • Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

      They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

      I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

      I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

      16 replies →

    • > We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

      I wonder if any of the people that quit regret doing so.

      Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"

      How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.

  • "haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage

    • Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.

      It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.

    • ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.

      The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).

      7 replies →

  • Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.

    https://x.com/METR_Evals/status/1912594122176958939

    —-

    The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.

    o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.

    Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.

    It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.

    Ref:

    - https://proceedings.neurips.cc/paper_files/paper/2012/file/c...

    - https://livebench.ai/#/

    • Imagenet had improved the error rate by 100*11/25=44%.

      o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.

      But these are meaningless comparisons because it’s typically harder to improve already good results.

  • OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.

  • > This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much

    Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.

    Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.

    As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.

    As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!

    • > Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about

      Or make important investors happy, they need to justify the latest $40 billion round

  • The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better

  • To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).

    On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.

    • As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

      Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.

      If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.

      16 replies →

  • Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.

    The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.

  • Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?

    I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.

    Dealing with OpenAI's API's is a straight up nightmare.

  • Most industries, or categories go through cycles of fragmentation and consolidation.

    AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.

    When the models start to plateau or the demands on the industry are for profit you will see consolidation start.

  • They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.

    Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.

    • I have *no idea* why you're being downvoted on this.

      If I want to take advantage of a new model, I must validate that the structured queries I've made to the older models still work on the new models.

      The last time I did a validation and update. Their Responses. Had. Changed.

      API users need dependability, which means they need older models to keep being usable.

      1 reply →

  • I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.

  • This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.

  • > This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model

    OpenAI's progress lately:

      2024 December - first reasoning model (official release)
    
      2025 February - deep search
    
      2025 March - true multi-modal image generation
    
      2025 April - reasoning model with tools
    

    I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)

  • If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.

  • There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.

  • Think for 30 seconds about why they might in good faith do what they do.

    Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.

  • you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best

  • > All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.

    That's not a problem in and of itself. It's only a problem if the models aren't good enough.

    Judging by ChatGPT's adoption, people seem to think they're doing just fine.

Here's a summary of this conversation so far, generated using o3 after 306 comments. This time I ran it like so:

  llm install llm-openai-plugin
  llm install llm-hacker-news
  llm -m openai/o3 -f hn:43707719 -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

https://gist.github.com/simonw/a35f39b070978e703d9eb8b1aa7c0... - cost 2,684 input, 2,452 output (of which 896 were reasoning tokens) which is 12.492 cents.

Then again with o4-mini using the exact same content (hence the hash ID for -f):

  llm -m openai/o4-mini \
    -f f16158f09f76ab5cb80febad60a6e9d5b96050bfcf97e972a8898c4006cbd544 \
  -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Output: https://gist.github.com/simonw/b11ba0b11e71eea0292fb6adaf9cd...

Cost 2,684 input, 2,681 output (of which 1,088 reasoning tokens) = 1.4749 cents

The above uses these two plugins: https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-hacker-news - taking advantage of new -f "fragments" feature I released last week: https://simonwillison.net/2025/Apr/7/long-context-llm/

Tyler cowen seems convinced https://marginalrevolution.com/marginalrevolution/2025/04/o3...

  • It can't solve this puzzle: https://i.imgur.com/AJqbqHJ.png

        Thought for 3m 51s
        Short answer → you can’t.
    

    The breathtaking thing is not the model itself, but that someone as smart as Cowen (and he's not the only one) is uttering "AGI" in the same sentence as any of these models. Now, I'm not a hater, and for many tasks they are amazing, but they are, as of now, not even close to AGI, by any reasonable definition.

    • I work for openai.

      o4-mini gets much closer (but I'm pretty sure it fumbles at the last moment): https://chatgpt.com/share/680031fb-2bd0-8013-87ac-941fa91cea...

      We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.

      4 replies →

    •   I think it is AGI, seriously.  Try asking it lots of questions, and then ask yourself: just how much smarter was I expecting AGI to be?
      

      That's his whole argument!!!! This is so frustrating coming from a public intellectual. "You don't need rigorous reasoning to answer these questions, baybeee, just go with your vibes." Complete and total disregard for scientific thinking, in favor of confirmation bias and ideology.

  • Tyler Cowen is someone I take seriously. I think he is one of the most rational thought leaders.

    But I have to say, his views on LLMs seem a little premature. He definitely has a unique viewpoint of what "general intelligence" is, which might not apply broadly to most jobs. I think "interviews" them like they were a guest on his podcast and bases his judgement on how they compare to his other extremely smart guests.

The most striking difference to me is that o3 and o4 know when the web search tool is unavailable, and will tell you they can't answer a question that requires it. While 4o and (sadly) 4.1 will just make up a bunch of nonsense.

I'm simultaneously impressed that they can do that, and also wondering why the heck that's so impressive (isn't "is this tool in this list?" something GPT-3 was able to handle?) and why 4.1 still fails at it too—especially considering it's hyped as the agentic coder model!

That's pretty damning for the general intelligence aspect of it, that they apparently had to special-case something so trivial... and I say that as someone who's really optimistic about this stuff!

That being said, the new "enhanced" web search seems great so far, and means I can finally delete another stupid 10 line Python script from 2023 that I shouldn't have needed in the first place ;)

(...Now if they'd just put 4.1 in the Chat... why the hell do I need to use a 3rd party UI for their best model!)

A suggestion for OpenAI to create more meaningful model names:

{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

Where:

* Size is XS/S/M/L/XL/XXL to indicate overall capability level

* Quarter/Year like Q2-25

* Speed/Accuracy indicated as Fast/Balanced/Precise

* Optional specialty tag like Code/Vision/Science/etc

Example model names:

* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)

  • This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.

  • I think they should name them after fictional characters. Bonus points if they're trademarked characters.

    "You gotta try Mickey, it beats the crap out of Gandalf in coding."

  • Thank god we don’t usually let engineers name stuff in the west.

    While this is entirely logical in theory this is how you get LG style naming like “THE ALL NEW LG-CFT563-X2”

    I mean, it makes total sense, it tells you exactly the model, region, series and edition! Right??

  • What about using Marvel superhero names (with permission, of course)? The studio keeps giving us stronger and stronger examples...

This post[1] is highlighted by Techmeme:

>I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)

Genuinely intrigued by what kind of “psychological/emotional problem I've been dealing with for years” could an AI solve in a matter of hours after its release.

[1] https://x.com/carmenleelau/status/1912645771955962300

Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.

After refreshing the browser I see that the old o3-mini-high has gone now so I continued my coding task conversation with o4-mini-high. In two separate conversations it butchered things in a way that I never saw o3-mini-high do. In one case it rewrote working code without reason, breaking it, in the other it took a function I asked it to apply a code fix to and it instead refactored it with a different and unrelated function that was part of an earlier bit of chat history.

I notice too that it employs a different style of code where it often puts assignment on a different line, which looks like it's trying to maintain an ~80 character line limit, but does so in places where the entire line of code is only about 40 characters.

  • Not saying it’s for sure the case but it might be that the model gets confused by OOD text from the other model whereas it expects its own text to be online from itself (particularly if the CoT is used as context for later conversations).

I’m having very mixed feelings about it. I’m using o3 to help me parse and understand a book about statistics and ML, it’s very dense in math.

On one hand the answers became a lot more comprehensive and deep. It’s now able to give me very advanced explanations.

On the other hand, it started overloading the answers with information. Entire concepts became single sentence summaries. Complex topics and theorems became acronyms. In a way I’m feeling overwhelmed by the information it’s now throwing at me. I can’t tell if it’s actually smarter or just too complicated for me to understand.

  • Pretty wild that we’re at the point that the human is the limitation

    • Surprise, the machine that interpolates from a database of maths books confuses a human who wants to learn about the contents of the books in that database.

The demo video is very impressive, and it shows what AI could be. Our current models are unreliable in research, but if they were reliable, then what's shown alone would be better than AGI.

There are 8 billion+ instances of general intelligence on the planet; there isn't a shortage. I'd rather see AI do data science and applied math at computer speeds. Those are the hard problems, a lot of the AGI problems (to human brains) are easy.

So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?

The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.

o4-mini is available on vs code. I've been playing with it for the last couple of hours. It's quite fast for a thinking model.

It's also super concise with code. Where claude 3.7 and gemini 2.5 will write a ton, o4-mini will write a tiny portion of it accomplishing the same task.

On the flip side, in its conciseness, it's more lazy with implementation than the other leading models missing features.

For fixing very complex typescript types, I've previously found that o1 outperformed the others. o4-mini seems to understand things well here.

I still think gemini will continue to be my favorite model for code. It's more consistent and follows instructions better.

However, openAI's more advanced models have a better shot at providing a solution when gemini and claude are stuck.

Maybe there's a win here in having o4-mini or o3 do a first draft for conciseness, revise with gemini to fill in what's missed (but with a base that is not overdone), and then run fixes with o4-mini.

Things are still changing quite quickly.

Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.

  • There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.

On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.

Both seem to be better at prompt following and have more up to date knowledge.

But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.

So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.

FWIW, o4-mini-high does not feel better o3-mini-high for working on fairly simply econ theory proofs. It does feel faster. And both elementary mistakes.

So it looks like no increase in context window size since it’s not mentioned anywhere.

I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.

I have been using o4-mini-high today. Most of the time for a file longer than 100 lines it stops generating randomly and won't complete a file unless I re-prompt it with the end of the missing file.

As usual, it's a frustrating experience for anything more complex than the usual problems everyone else does.

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/

o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.

So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.

Doesn't achieving AGI mean the beginning of the end of humanity's current economic model? I'm not sure I understand the presumption by many that achieving AGI is just another step in some company's offering.

  • Most days I feel the same.

    Other days I remember that humans like "handmade" furniture, and live performances, and unique styles, and human contact.

    Perhaps there's life in us still?

A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.

Good thing I stopped working a few hours ago

EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(

I find o4 very bad at coding. I tried to improve a script created by 3.5 mini-high with o4 mini-high and it doesn't return nearly as good results as what i used to get by o3.5

I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?

  • I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.

  • o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.

I noticed that OpenAI don't compare their models to third party models in their announcement posts, unlike google, meta and the others.

At this point, it's like comparing the iPhone 5s vs the iPhone 6. The upgrades are still noticeable, but it's nowhere the huge jump between GPT 3.5 and GPT 4.

It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?

It has been getting better IMO.

o4 is doing a better job than o3 on my current project, and while this isn’t really a priority, its personality is somehow far more engaging now.

> Downloaded an untouched char.lgp from the current Steam build (1.0.9) to make sure the count reflects the shipping game rather than a modded archive.

How?

Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.

  • They are replacing o1 with o3 in the UI, at least for me, so they must be pretty confident it is a strict improvement.

o3 joins gemini-2.5-pro as the only other model that can pace long form creative writing properly when details about the story are provided.

I'm confused. I typically use o1 for all of my questions. Now it's disappeared. Is o3 a better model?

  • Yes, in almost all aspects if you do not use the o1-pro. o3-pro is not available yet.

The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.

  • Meanwhile we have people elsewhere in the thread complaining about too many models.

    Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.

    • >Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.

      Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.

      I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.

      The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.

Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?

  • Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).

    • It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.

Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.

This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1

Why are they doing this?

Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.

The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus

I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.

I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.

Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.

  • I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.

I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?

Anyone got codex working? After installing and setting up API Key I get this error :

    system
      OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)

If the ai is smart, why not have it choose the model for the user

  • That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?

I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.

Here are some notes I made to understand each of these models and when to use them.

# OpenAI Models

## Reasoning Models (o-series) - All `oX` (o-series aka `omni`) models are reasoning models. - Use these for complex, multi-step, reasoning tasks.

## Flagship/Core Models - All `x.x` and `Xo` models are the core models. - Use these for one-shot results - Examples: 4o, 4.1

## Cost Optimized - All `-mini`, `-nano` are cheaper, faster models. - Use these for high-volume, low effort tasks.

## Flagship vs Reasoning (o-series) Models - Latest flagship model = 4.1 - Latest reasoning model = o3 - The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching. - The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.

# List of Models

## 4o (omni) - 128K context window - complex multimodal, applications requiring the top level of reliability and nuance

## 4o-mini - 128K context window - Use: multimodal reasoning for math, coding, and structured outputs - Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost. - Dont Use: When high accuracy is needed

## 4.1 - 1M context window - Use: For large context ingest, such as full codebases - Use: For reliable instruction following, comprehension - Dont Use: For high volume/faster tasks

## 4.1-mini - 1M context window - Use: For large context ingest - Use: When a tradeoff can be made with accuracy vs speed

## 4.1-nano - 1M context window - Use: For high-volume, near-instant responses - Dont Use: When accuracy is required - Examples: classification, autocompletion, short-answers

## o3 - 200K context window - Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use - Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop - Dont Use: For simple tasks, where lighter model will be faster and cheaper.

## o4-mini - 200K context window - Use: High-volume needs where reasoning and cost should be balanced - Use: For high throughput applications - Dont Use: When accuracy is critical

## o4-mini-high - 200K context window - Use: When o4-mini results are not satisfactory, but before moving to o3. - Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory - Dont Use: When accuracy is critical

## o1-pro-mode - 200K context window - Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency - Dont Use: For simple tasks

## Models Sorted for Complex Coding Tasks (my opinion)

1. o3 2. Gemini 2.5 Pro 3. Claude 3.7 2. o1-pro-mode 3. o4-mini-high 4. 4.1 5. o4-mini

4o and o4 at the same time. Excellent work on the product naming, whoever did that.

  • It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.

  • Just wait until they announce oA and A0.

    They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.

    • Energy Intensive Exceptional Intelligence (Omni-domain), AKA E-I-E-I-O.

What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.

  • It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?

  • I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)

  • Seems to me like they're somewhat trying to simplify now.

    GPT-N.m -> Non-reasoning

    oN -> Reasoning

    oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)

    It would be nice if they actually stick to this pattern.

    • I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward

    • But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).

    • Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.

OpenAI be like:

    o1, o1-mini,
    o1-pro, o3,
    o4-mini, gpt-4,
    gpt-4o, gpt-4-turbo,
    gpt-4.5, gpt-4.1,
    gpt-4o-mini, gpt-4.1-mini,
    gpt-4.1-nano, gpt-3.5-turbo

I have doubts whether the live stream was really live.

During the live-stream the subtitles are shown line by line.

When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.