Comment by culi
1 day ago
I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.
A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.
I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage
No comments yet
Contribute on Hacker News ↗