Comment by avereveard
7 hours ago
the ecosystem obsession with public benchmarks comes from the fact that running benchmark costs, and labs don't test on any given private benchmark
but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while
still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect
No comments yet
Contribute on Hacker News ↗