← Back to context

Comment by yorwba

10 hours ago

There are objective ways to compare models. They involve repeated sampling and statistical analysis to determine whether the results are likely to hold up in the future or whether they're just a fluke. If you fine-tune each model to achieve its full potential on the task you expect to be giving it, the rankings produced by different benchmarks even agree to a high degree: https://arxiv.org/abs/2507.05195

The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.

LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.

That's objective metrics. Not an objective way to compare, which is the selection of metrics to include.

  • That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance.

    I reckon we'll have similar suites comparing different aspects of models.

    And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.

    • > I reckon we'll have similar suites comparing different aspects of models.

      The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware

      Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results

Fine-tuning for a specific task is even much less realistic than the benchmarks shown in TFA.

Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.

On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.

Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.

This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.