Comment by sempron64

6 days ago

The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.

Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway?

  • SVG generation is a good test because it's extremely easy to subjectively assess with visual reasoning where humans are strong. However, pelican on a bike specifically may be overused at this point.

The "big beak!" comment in the svg source makes me think it's definitely a gamed "benchmark" at this point.

Do you think the models are ready for the next level? I believe that would be: Pelican feeding Spaghetti to Will Smith.

Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme.

I'd be very surprised if this is in the training data given that most models mess it up to this day. E.g. look at the ones from Opus.

[flagged]