← Back to context

Comment by epolanski

13 hours ago

Yet this is how virtually everybody is benchmarking and fine tuning.

Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.

It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.

I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.