Comment by geraneum
3 hours ago
> Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.
The model improvements being beyond human comprehension is one of the more ridiculous statements I’ve heard in the last couple of days about AI. We could reason about Higgs bosons and gravitational waves but have no ability to quantify or reason about the difference between Opus 4.7 vs 4.8.
I definitely believe that you can discern differences between Opus 4.6, 4.7, and 4.8. I might also believe that you believe that you can discern improvements between Opus 4.6, 4.7, and 4.8. But conclusively, consistently, scientifically, and blindly discerning improvement is at this point restricted to problem domains that represent a vanishingly small amount of global token usage, like Erdos problems, superhuman evals, and the like. The idea that typical line of business use-cases have seen broad and measurable improvements since even Opus 4.5 but certainly 4.6 is mostly an illusion that confuses improvements in the harness for improvements in the model, as well as confuses "its different" for "its better".
To be clear, again, cannot stress this enough: I am NOT saying that the models have hit a limit. I am saying that the complexity of the problems most businesses throw at them have always had a limit. The models are now so intelligent that we have not, as of yet, adapted our business use-cases to make use of the new levels of intelligence. Maybe we will.