Comment by shepherdjerred

6 days ago

> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

7 comments

shepherdjerred

> So maybe the AI labs have been paying attention after all!

> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.

As acknowledged in the article.

kzrdude 6 days ago
Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.
- sunaookami 5 days ago
  
  Gemini is heavily benchmaxxed and sucks in agentic coding so no surprise.

nickvec 6 days ago

Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!

aaronbrethorst 6 days ago

Banana man on the Segway

simonw 6 days ago

That bit probably works better in the talk, it was a setup for a joke later on.

muzani 6 days ago

It's practically a benchmark now. Some friends have been specifically training models to count the R's in "strawberry"