Comment by simonw

20 hours ago

I've been trying out the new model like this:

  OPENAI_API_KEY="$(llm keys get openai)" \
    uv run https://tools.simonwillison.net/python/openai_image.py \
    -m gpt-image-2 \
    "Do a where's Waldo style image but it's where is the raccoon holding a ham radio"

Code here: https://github.com/simonw/tools/blob/main/python/openai_imag...

Here's what I got from that prompt. I do not think it included a raccoon holding a ham radio (though the problem with Where's Waldo tests is that I don't have the patience to solve them for sure): https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...

I just got a much better version using this command instead, which uses the maximum image size according to https://github.com/openai/openai-cookbook/blob/main/examples...

  OPENAI_API_KEY="$(llm keys get openai)" \
    uv run 'https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/openai_image.py' \
    -m gpt-image-2 \
    "Do a where's Waldo style image but it's where is the raccoon holding a ham radio" \
    --quality high --size 3840x2160

https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a... - I found the raccoon!

I think that image cost 40 cents.

> though the problem with Where's Waldo tests is that I don't have the patience to solve them for sure

I see an opportunity for a new AI test!

  • There have already been several attempts to procedurally generate Where’s Waldo? style images since the early Stable Diffusion days, including experiments that used a YOLO filter on each face and then processed them with ADetailer.

    It's a difficult test for genai to pass. As I mentioned in a different thread, it requires a holistic understanding (in that there can only be one Waldo Highlander style), while also holding up to scrutiny when you examine any individual, ordinary figure.

  • I've actually been feeding them into Claude Opus 4.7 with its new high resolution image inputs, with mixed results - in one case there was no raccoon but it was SURE there was and told me it was definitely there but it couldn't find it.

Really hard to look at these images given how not human like the humans are. A few are ok, but a lot are disfigured or missing parts and its hard to find a raccoon in here.

Thanks for the image, I will see their faces in my nightmares.

  • This happens all too frequently when you ask a GenAI model to create an image with a large crowd especially a “Where’s Waldo?” style scenes, where by definition you’re going to be examining individual faces very closely.

Like... this has things that AI will seemingly always be terrible at?

At some point the level of detail is utter garbo and always will be. An artist who was thoughtful could have some mistakes but someone who put that much time into a drawing wouldn't have:

- Nightmarish screaming faces on most people

- A sign that points seemingly both directions, or the incorrect one for a lake and a first AID tent that doesn't exist

- A dog in bottom left and near lake which looks like some sort of fuzzy monstrosity...

It looks SO impressive before you try to take in any detail. The hand selected images for the preview have the same shit. The view of musculature has a sternocleidomastoid with no clavicle attachment. The periodic table seems good until you take a look at the metals...

We're reconfiguring all of our RAM & GPUs and wasting so much water and electricity for crappier where's Waldos??

  • AI will seemingly always be ...

    You do realize that the whole image generation field is barely 10 years old?

    I remember how I was able to generate mnist digits for the first time about 10 years ago - that seemed almost like magic!

Damn. There’s a fun game app to make here ^^

  • Is there? The moment you look closely at the puzzle (which is... the whole point of Where's Waldo), you notice all the deformities and errors.

    • Yes, it’s not there yet. But nothing unsolvable. First thing that comes to mind would be generating smaller portion at the same resolution, then expand through tiling (although one might need to use another service & model for this), like we used to do with Stable Diffusion years ago.

      Another option would be generating these large images, splitting them into grids, and using inpainting on each "tile" to improve the details. Basically the reverse of the first one.

      Both significantly increase costs, but for the second one having what Images 2.0 can produce as an input could help significantly improve the overall coherence.

5.4 thinking says "Just right of center, immediately to the right of the HAM RADIO shack. Look on the dirt path there: the raccoon is the small gray figure partly hidden behind the woman in the red-and-yellow shirt, a little above the man in the green hat. Roughly 57% from the left, 48% from the top."

(I don't think it's right).