Comment by wisty

2 months ago

It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

Yes it's like the wine glass thing.

Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?

I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.

An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.

An OK AI will draw a penguin on top of a bicycle and just call it a day.

It's not as binary as the wine glass example.

4 comments

wisty

thatwasunusual 2 months ago

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

Fnoord 2 months ago
> the wine glass scenario is a _realistic_ scenario
It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.
A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].
[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...
- mzl 2 months ago
  
  A better reason why wine glasses are not filled like that is that wine glasses are designed to capture the aroma of the wine.
  Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.
vikramkr 2 months ago

If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?