Comment by kennykartman

25 days ago

Ah ah I was curious about that! I wonder if (when? if not already) some company is using some version of this in their training set. I'm still impressed by the fact that this benchmark has been out for so long and yet produce this kind of (ugly?) results.

14 comments

kennykartman

NitpickLawyer 25 days ago

It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...

kennykartman 24 days ago

Sure, I agree! I did not mean to see better results because LLMs improved significantly in their visual-spatial reasoning, but simply because I expected more people drawing SVGs of pelicans on bikes and having more LLMs ingesting them. This is what I find a bit surprising.
Sharlin 25 days ago

It could still be special-case RLHF trained, just not up to perfection.

saberience 25 days ago

Because no one cares about optimizing for this because it's a stupid benchmark.

It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.

I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.

The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.

obidee2 25 days ago

Why stupid? Vector images are widely used and extremely useful directly and to render raster images at different scales. It’s also highly connected with spacial and geometric reasoning and precision, which would open up a whole new class of problems these models could tackle. Sure, it’s secondary to raster image analysis and generation, but curious why it would be stupid to persue?
simonw 25 days ago
+1 to "it's a stupid benchmark".
- esafak 24 days ago
  
  You can always suggest a new one ;)
lofaszvanitt 25 days ago
It shows that these are nowhere near anything resembling human intelligence. You wouldn't have to optimize for anything if it would be a general intelligence of sorts.
- CamperBob2 25 days ago
  
  Here's a pencil and paper. Let's see your SVG pelican.
  
  3 replies →
storystarling 25 days ago

I suspect there is actually quite a bit of money on the table here. For those of us running print-on-demand workflows, the current raster-to-vector pipeline is incredibly brittle and expensive to maintain. Reliable native SVG generation would solve a massive architectural headache for physical product creation.

derefr 25 days ago

It’d be difficult to use in any automated process, as the judgement for how good one of these renditions is, is very qualitative.

You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.