Comment by Voloskaya
5 days ago
> We're well into the diminishing returns at this point
Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.
Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.
Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.
* GPT-1 was 117M parameters.
* GPT-2 was 1.5B params (~10x).
* GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).
After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.
But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.
So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.
So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.
> but I am assuming you mean diminishing beyond what the scaling laws predict.
You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.
But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.
I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.