Comment by iLoveOncall
16 hours ago
Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.
For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.
Good point. So much functionality gets commoditized, we have to move goalposts more or less constantly.