The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.
Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.
The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.
There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.
Or indeed do the Markov chain conceptual slip. Pelican on bicycle, badger on stool, tiger on acid. Pelican on bicycle is definitely cooked, though: people know it and it's talked about in language.
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
10 replies →
Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.
Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.
The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.
There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.
Or indeed do the Markov chain conceptual slip. Pelican on bicycle, badger on stool, tiger on acid. Pelican on bicycle is definitely cooked, though: people know it and it's talked about in language.