Comment by crowcroft

5 months ago

Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).

Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.

2 comments

crowcroft

ben_w 5 months ago

> Sometimes I wonder if there is overfitting towards benchmarks

There absolutely is, even when it isn't intended.

The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.

(Ok, not every problem, there's also sample efficiency, and…)

FergusArgyll 5 months ago

Ya, Claude crushes the smell test