Comment by spoaceman7777
3 hours ago
Wow. They must have had some major breakthrough. Those scores are truly insane. O_O
Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there
But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.
Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.
And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD
The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.
Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.
This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.
shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.