VibeBench: Measuring 1k Engineers' Opinions of New Models

3 days ago (vibebench.standardagents.ai)

"Published benchmarks are gamed, optimized, and overfit, and no longer yield a useful signal."

Is this true?

But I love this concept!