Comment by mmsc

6 days ago

I understand that things are moving fast and all, but surely the.. 8? models which are currently available is a bit .. overwhelming for users that just want to get answers to their questions of life? What's the end goal with having so many models available?

I just can’t believe nobody at the company has enough courage to tell their leadership that their naming scheme is completely stupid and insane. Four is greater than three, and so four should be better than three. The point of a name is to describe something so that you don’t confuse your users, not to be cute.

  • The reason their naming scheme is so bad is because their initial attempts at GPT-5 failed in training. It was supposed to be done ~1 year ago. Because they'd promised that GPT-5 would be vastly more intelligent than GPT-4, they couldn't just name any random model "GPT-5", so they suddenly had to start naming things differently. So now there's GPT-4.5, GPT-4.1, the o-series, ...

    • Surely there's a less stupid way than naming two very different models o4 and 4o.

  • What’s worse is that the app doesn’t even have descriptions. As if I’m supposed to memorize the use case for each based on:

    GPT-4o

    o3

    o4-mini

    o4-mini-high

    GPT-4.5

    GPT-4.1

    GPT-4.1-mini

    • Even o3 can't figure out the naming scheme. When asked to generate names that would be easier to use by non-technical people, it assumed o3 and o4 are the smallest models:

      ``` Below is one straightforward, user-friendly approach you could adopt. It keeps two dimensions only—generation and tier—and reserves an optional “optimisation” suffix for special-purpose variants (e.g. vision, coding, long-context).

      1. Core conventions

      Element Purpose Example values Generation Major architectural release. Keep a whole number; use “.1”, “.2”… for mid-cycle improvements. 4, 4.1, 4.5 Tier Rough capability / cost band, easy to interpret. Lite, Standard, Pro, Ultra Suffix (optional) Special optimisation or domain specialisation. -LongCtx, -Vision, -Code

      Why this works • No ambiguous letters or numerics – “o3” can be read as “03” or “oz”; avoid that entirely. • Self-explanatory language – non-technical users recognise “Lite” versus “Pro” instantly. • Scalable – new minor rev? bump the generation (4.2). Need a cheaper size? add a Nano tier without disturbing the rest.

      2. Applying it to your current list

      Current name Proposed new name Rationale GPT-4o GPT-4 Standard Baseline flagship of the 4-series. o3 GPT-4 Lite Same generation, lowest tier. o4-mini GPT-4 Lite+ (or GPT-4 Lite LongCtx if that’s the point) Indicates “Lite” family but a bit more capable; “+” or a suffix clarifies how. o4-mini-high GPT-4 Standard LongCtx (or GPT-4 Lite Pro) Pick one dimension: either it’s still “Lite” but higher context, or it has moved into “Standard”. GPT-4.5 GPT-4.5 Standard Mid-cycle architectural upgrade, default tier. GPT-4.1 GPT-4.1 Standard Ditto. GPT-4.1-mini GPT-4.1 Lite Same generation, smaller/cheaper option.

      3. Quick style guide for future models 1. Stick to two words (or two words + optional suffix) GPT-5 Pro, GPT-5 Lite-Vision – still readable at a glance. 2. Reserve extra punctuation for special cases only Hyphens or the “+” symbol should signal meaning, not be decorative. 3. Publish a public matrix A small table in docs or the dashboard that maps Generation × Tier → context length, cost, latency eliminates guesswork.

      One-line summary

      GPT- [-Specialisation] keeps names short, descriptive and future-proof—so even non-technical users can tell instantly which model suits their needs. ```

  • If you obfuscate the naming, you obfuscate the value proposition, and people become easier to mislead into choosing an overly expensive model. Same as with Intel CPUs, or many many other hardware products.

  • At Techcrunch AI last week, the OpenAI guy started his presentation by acknowledging that OpenAI knows their naming is a problem and they're working on it, but it won't be fixed immediately.

  • Came here to say this, the naming scheme is ridiculous and is getting more impossible to follow each day.

    For example the other day they released a supposedly better model with a lower number..

    • I’d honestly prefer they just have 3 personas of varying cost/intelligence: Sam, Elmo and Einstein or something, and then tack on the date, elmo-2025-1 and silently delete the old ones.

There's a humorous version of Poe's law that says "any sufficiently genuine attempt to explain the differences between OpenAI's models is indistinguishable from parody"

This is a much more expensive model to run and is only available to users who pay the most. I don't see an issue.

However, the "plus" plan absolutely could use some trimming.

free users don't have this model selector, and probably don't care which model they get so 4o is good enough. paid users at 20$/month get more models which are better, like o3. paid users at 200$/month get the best models that are also costing OpenAI the most money, like o3-pro. I think they plan to unify them with GPT-5.

  • That doesn't help much when we're asymptotically approaching GPT-5. We're probably going to be at GPT-4.9999 soon.

    • Not necessarily true. GPT-4.1 was released after GPT-4.5-preview. Next model might be GPT-3.7.

  • I'd be curious what proportion of paid users ever switch models. I'd guess < 10%

    • I switch to o1-pro on occasion, but it is slow enough that I don't use it as much as some of the others. It is a reasonably-effective last resort when I'm not getting the answer quality that I think should be achievable. It's the best available reasoning model from any provider by a noticeable margin.

      Sounds like o3-pro is even slower, which is fine as long as it's better.

      o4-mini-high is my usual go-to model if I need something better than the default GPT4-du jour. I don't see much point in the others and don't understand why they remain available. If o3-pro really is consistently better, it will move o1-pro into that category for me.

I'd like one to do my test use case:

Port unix-sed from c to java with a full test suite and all options supported.

Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.

It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.

It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.

GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.

The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.

I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.

  • I have a very simple question with like, 5 lines at best, that basically no model, neither reasoning or simpler could grasp. For obvious reasons I'm not disclosing it here (because I fear data contamination in the long run), but it basically breaks the "reasoning" of those things. Unfortunately, I still can't try the o3-pro because the API version is not easily available, and I'm certainly not willing to pay for it in pro mode, but when it comes to the plus version (if it comes) I'll try. To this date, because of this question (and similar ones) I stand very unimpressed with those models, the marketing is a thousand times larger than reality, and I suspect people in general are surprisingly less capable of detecting intelligence than they think.

    The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.

  • The massive real market here is enterprises that need to rewrite legacy code to modern platforms, retaining the business logic as-is but modernising the style.

    .NET Framework 4.x to .NET 10, Python 2 to 3, Java 8 to <current version>, etc...

    The advantage the LLMs have here is that staying within the same programming language and its paradigm is dramatically simpler than converting a "procedural" language like C to an object-oriented language like Java that has a wildly different standard library.

  • How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT release do on that task? It obviously demands a massive context window.

    • I'll check once I get the nitrous tanks and the aftermarket turbos overnighted from Japan arrive.

Models are used for actual tasks where predictable behavior is a benefit. Models are also used on cutting-edge tasks where smarter/better outputs are highly valued. Some applications value speed and so a new, smaller/cheaper model can be just right.

I think the naming scheme is just fine and is very straightforward to anyone who pays the slightest bit of attention.

> users that just want to get answers to their questions of life

Those users go to chat.openai.com (or download the app), type text in the box and click send.