Comment by littlestymaar
5 days ago
> I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro.
This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.
At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.
I have tried Claude 4.0 for agentic programming tasks, and it really outperforms Claude 3.7 by quite a bit. I don't follow the benchmarks - I find them a bit pointless - but anecdotally, Claude 4.0 can help me in a lot of situations where 3.7 would just flounder, completely misunderstand the problem and eventually waste more of my time than it saves.
Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.
> We're well into the diminishing returns at this point
Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.
Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.
Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.
* GPT-1 was 117M parameters.
* GPT-2 was 1.5B params (~10x).
* GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).
After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.
But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.
So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.
So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.
> but I am assuming you mean diminishing beyond what the scaling laws predict.
You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.
But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.
I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.