Comment by camdenreslink

1 day ago

The current benchmarks are good for comparing between models, but not for measuring absolute ability.

Not even that, see LMArena. They vaguely gesture in the general direction of the model being good, but between contamination and issues with scoring they're little more than a vibe check.

But if the test metrics are fundamentally flawed they might not be useful even for relative comparisons. Like if I told you that Model A scores 10x as many blorks points as model B, I don’t know how you translate that into insights about performance on real world scenarios.

I don't really buy that they're even necessarily useful for comparing models. In the example from the article, if model A says "48 + 6 minutes" and gets marked correct, and model B says "63 minutes" (the correct answer) and gets marked correct, the test will say that they're equivalent on that axis when in fact one gave a completely nonsense answer.