Comment by segmondy
12 hours ago
you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...
12 hours ago
you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...
I don't buy this. I've long wondered if the larger models, while exhibiting more useful knowledge, are not more wasteful as we greedily explore the frontier of "bigger is getting us better results, make it bigger". Qwen3-Coder-Next seems to be a point for that thought: we need to spend some time exploring what smaller models are capable of.
Perhaps I'm grossly wrong -- I guess time will tell.
You are not wrong, small models can be trained for niche use cases and there are lots of people and companies doing that. The problem is that you need one of those for each use case whereas the bigger models can cover a bigger problem space.
There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
Not an AI researcher and I don't really know, but intuitively it makes a lot of sense to me.
To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".
So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.
If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.
And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.
Is that counterintuitive? If I had a model trained on 10 different programming languages, including my target language, I would expect it to do better than a model trained only on my target language, simply because it has access to so much more code/algorithms/examples then my language alone.
i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.
1 reply →
eventually we will have smarter smaller models, but as of now, larger models are smarter by far. time and experience has already answered that.
Eventually we might have smaller but just as smart models. There is no guarantee. There are information limits to smaller models of course.
There is (must be - information theory) a size/capacity efficiency frontier. There is no particular reason to think we're anywhere near it right now.
Aren't both latest opus and sonnet smaller than the previous versions?