← Back to context

Comment by mormegil

1 day ago

Did the testing prompt for LLMs include a clause forbidding the use of any tools? If not, why are you adding it here?

The way I run the pelican on a bicycle benchmark is to use this exact prompt:

  Generate an SVG of a pelican riding a bicycle

And execute it via the model's API with all default settings, not via their user-facing interface.

Currently none of the model APIs enable tools unless you ask them to, so this method excludes the use of additional tools.

The models that are being put under the "Pelican" testing don't use a GUI to create SVGs (either via "tools" or anything else), they're all Text Generation models so they exclusively use text for creating the graphics.

There are 31 posts listed under "pelican-riding-a-bicycle" in case you wanna inspect the methodology even closer: https://simonwillison.net/tags/pelican-riding-a-bicycle/