← Back to context

Comment by NitpickLawyer

24 days ago

It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...

Sure, I agree! I did not mean to see better results because LLMs improved significantly in their visual-spatial reasoning, but simply because I expected more people drawing SVGs of pelicans on bikes and having more LLMs ingesting them. This is what I find a bit surprising.