Comment by 827a

3 hours ago

I definitely believe that you can discern differences between Opus 4.6, 4.7, and 4.8. I might also believe that you believe that you can discern improvements between Opus 4.6, 4.7, and 4.8. But conclusively, consistently, scientifically, and blindly discerning improvement is at this point restricted to problem domains that represent a vanishingly small amount of global token usage, like Erdos problems, superhuman evals, and the like. The idea that typical line of business use-cases have seen broad and measurable improvements since even Opus 4.5 but certainly 4.6 is mostly an illusion that confuses improvements in the harness for improvements in the model, as well as confuses "its different" for "its better".

To be clear, again, cannot stress this enough: I am NOT saying that the models have hit a limit. I am saying that the complexity of the problems most businesses throw at them have always had a limit. The models are now so intelligent that we have not, as of yet, adapted our business use-cases to make use of the new levels of intelligence. Maybe we will.