← Back to context

Comment by thatwasunusual

12 hours ago

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

> the wine glass scenario is a _realistic_ scenario

It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.

A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].

[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...

  • A better reason why wine glasses are not filled like that is that wine glasses are designed to capture the aroma of the wine.

    Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.

If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?