Comment by epolanski
13 hours ago
Yet this is how virtually everybody is benchmarking and fine tuning.
Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.
It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.
I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.
No comments yet
Contribute on Hacker News ↗