Comment by minimaxir

16 hours ago

So during my Nano Banana Pro experiments I wrote a very fun prompt that tests the ability for these image generation models to follow heuristics, but still requires domain knowledge and/or use of the search tool:

    Create a 8x8 contiguous grid of the Pokémon whose National Pokédex numbers correspond to the first 64 prime numbers. Include a black border between the subimages.

    You MUST obey ALL the FOLLOWING rules for these subimages:
    - Add a label anchored to the top left corner of the subimage with the Pokémon's National Pokédex number.
      - NEVER include a `#` in the label
      - This text is left-justified, white color, and Menlo font typeface
      - The label fill color is black
    - If the Pokémon's National Pokédex number is 1 digit, display the Pokémon in a 8-bit style
    - If the Pokémon's National Pokédex number is 2 digits, display the Pokémon in a charcoal drawing style
    - If the Pokémon's National Pokédex number is 3 digits, display the Pokémon in a Ukiyo-e style

The NBP result is here, which got the numbers, corresponding Pokemon, and styles correct, with the main point of contention being that the style application is lazy and that the images may be plagiarized: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxaerni...

Running that same prompt through gpt-2-image high gave an...interesting contrast: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxaerni...

It did more inventive styles for the images that appear to be original, but:

- The style logic is by row, not raw numbers and are therefore wrong

- Several of the Pokemon are flat-out wrong

- Number font is wrong

- Bottom isn't square for some reason

Odd results.

Prompts like this feel like it's using the wrong abstraction. The "obvious" thing to do with something like this would be to generate some code that generates the image and then run that code.

Inspired by this, I tried something much simpler. I asked it to draw 12 concentric circles. With three tries it always drew 10 instead. https://chatgpt.com/share/69e87d08-5a14-83eb-9a3b-3a8eb14692...

  • I think prompts like this are where agentic workflows come in to play. If you asked it to do generate the first 64 prime numbers, AI tools could do that. If you asked it to draw a charcoal image of Pokemon 13, it could do that. If you asked it to add a white Menlo 13 on a black background to the top left corner of that image, it could do that. If you asked it to do that 63 more times, it could do those things, and if you asked it to assemble those into a grid, it could.

    It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.

This is an amazing test and it's kinda' funny how terrible gpt-2-image is. I'd take "plagiarized" images (e.g. Google search & copy-paste) any day over how awful the OpenAI result is. Doesn't even seem like they have a sanity checker/post-processing "did I follow the instructions correctly?" step, because the digit-style constraint violation should be easily caught. It's also expensive as shit to just get an image that's essentially unusable.

For what it's worth, NBP made some mistakes too.

Artistic oddities aside (why are the 8-bit sprites 16-bit, why do the charcoal drawings have colour, why does the art of specifically the Gen 1 Pokemon look so off.), 271 is Lombre, not Lotad.

Why would you consider this a good prompt?

  • Because both Nano Banana Pro and ChatGPT Images 2.0 have touted strong reasoning capabilities, and this particular prompt has more objective, easy-to-validate criteria as opposed to the subjective nature of images.

    I have more subjective prompts to test reasoning but they're your-mileage-may-vary (however, gpt-2-image has surprisingly been doing much better on more objective criteria in my test cases)

  • [flagged]

banana Pro gets the logic and punts on the art; gpt-2-image gets the art and punts on the logic. Feels like instruction-following and creativity sit on opposite ends of the same slider.

Even a few months ago, ChatGPT/Sora's image generation performed better than Gemini/Nano Banana for certain weird prompts:

Try things like: "A white capybara with black spots, on a tricycle, with 7 tentacles instead of legs, each tentacle is a different color of the rainbow" (paraphrased, not the literal exact prompt I used)

Gemini just globbed a whole mass of tentacles without any regards to the count

Prob a very unscientific way to test an image model. This would me likely because they have the reasoning turned down and let its instant output takeover

  • There's no good scientific way to test a closed-source model with both nondeterministic and subjective output.

    This example image was generated using the API on high, not the low reasoning version. (it is slow and takes 2 minutes lol)

  • If the results are quantifiable/objective and repeatable it's scientific, how is it not scientific?

    The reasoning amount is part of the evaluation isn't it?