Comment by crowcroft
5 months ago
Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).
Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.
> Sometimes I wonder if there is overfitting towards benchmarks
There absolutely is, even when it isn't intended.
The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.
(Ok, not every problem, there's also sample efficiency, and…)
Ya, Claude crushes the smell test