Comment by hamdouni

21 hours ago

Maybe this can help

https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

2 comments

hamdouni

It doesn't, I get that it's _a_ benchmark. It's just not a good or insightful one, and having it posted so often on HN feels like low quality spam at this point

VHRanger 14 hours ago

The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts law)
The best LLM benchmarks test around the margins of those behaviors, tasks that are difficult and correlate with usefulness while being removed enough to stay unpolluted