Comment by fooker

7 hours ago

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

  • Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

    • Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.

      When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

  • For the current state of AI, the harness is unfortunately part of the secret sauce.