Comment by isx726552

18 hours ago

> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.