For every combination of animal and vehicle? Very unlikely.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
Why would they train on that? Why not just hire someone to make a few examples.
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
14 replies →
For every combination of animal and vehicle? Very unlikely.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
1 reply →
You can easily make a RLAIF loop.
- Take a list of n animals * m vehicule
- Ask a LLM to generate SVG for this n*m options
- Generate png from the svg
- Ask a Model with vision to grade the result
- Change your weight accordingly
No need to human to draw the dataset, no need of human to evaluate.
I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
You can always ask for a tyrannosaurus driving a tank.