← Back to context

Comment by simonw

1 day ago

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.

What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.