Comment by sempron64
6 days ago
The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.
Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway?
SVG generation is a good test because it's extremely easy to subjectively assess with visual reasoning where humans are strong. However, pelican on a bike specifically may be overused at this point.
The "big beak!" comment in the svg source makes me think it's definitely a gamed "benchmark" at this point.
Do you think the models are ready for the next level? I believe that would be: Pelican feeding Spaghetti to Will Smith.
Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme.
it is more an example of gaming (the HN system) than meme.
I'd be very surprised if this is in the training data given that most models mess it up to this day. E.g. look at the ones from Opus.
[flagged]
I really don't understand what's interesting about this test and why is it always on top.
It's funny.
2 replies →
Same reason you would always see the same top comments on reddit during a certain era.
6 replies →
It has become a funny meme, much like "My hovercraft is full of eels!"
because you can't still ask LLMs to port DOOM to hardware X or Y
It's a meme, and HN loves upvoting memes. Just like Reddit!
The ultimate measure of an LLM is whether it can produce a capable image of a pelican riding a bicycle. All other use cases are but a distraction!
Do you seriously have a dedicated “bad takes on AI” hn account?
yeah, although I do combine it with "replies to snarky questions" for efficiency
True that