Comment by simonw
18 hours ago
I've been trying out the new model like this:
OPENAI_API_KEY="$(llm keys get openai)" \
uv run https://tools.simonwillison.net/python/openai_image.py \
-m gpt-image-2 \
"Do a where's Waldo style image but it's where is the raccoon holding a ham radio"
Code here: https://github.com/simonw/tools/blob/main/python/openai_imag...
Here's what I got from that prompt. I do not think it included a raccoon holding a ham radio (though the problem with Where's Waldo tests is that I don't have the patience to solve them for sure): https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...
I just got a much better version using this command instead, which uses the maximum image size according to https://github.com/openai/openai-cookbook/blob/main/examples...
https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a... - I found the raccoon!
I think that image cost 40 cents.
Fed into a clear Claude Code max effort session with : "Inspect waldo2.png, and give me the pixel location of a raccoon holding a ham radio.". It sliced the image into small sections and gave:
"Found the raccoon holding a ham radio in waldo2.png (3840×2160).
Which is correct!
I had one problem: finding the raccoon. Now I have two: finding the red-and-white striped souvenir umbrella, and finding the raccoon.
4 replies →
We would need a larger sample size than just myself, but the raccoon was in the very first spot I looked. Found it literally immediately, as if that's where my eyes naturally gravitated to first. Hopefully that's just luck and not an indictment of the image-creating ability, as if there is some element missing from this "Where's Waldo" image, that would normally make Waldo hard to find.
1 reply →
Funny how it can look convincing from far away but once you zoom in you find out most characters have a mix of leprosy and skin cancer.
A startling number of people either have no arms, one arm, a half of an arm, or a shrunken arm; how odd!
To be fair, the average person has fewer than two arms.
2 replies →
There id a leg that sprouts into part of bush, perhaps that's where people's legs are disappearing to.
This is why they're congregating around the first aid and the lost and found
Finding the raccoon was instant. Finding all the weird AI artifacts is more fun. It's quite fascinating really. As usual it looks impressive at a glance but completely falls apart on closer inspection. I also didn't find any jokes, unless maybe the bridge to nowhere or finger posts pointing both ways counts?
The faces...that's nice that it turned a kid's book into an abomination
By image generation standards this is a ridiculously good result. No surprise that people instantly find the new limits, but they are new limits.
2 replies →
It's interesting that the raccoon is well defined because it was a part of the request. But none of the other Fauna are.
it's interesting, zoomed out it kind of looks ok, zoomed in.... oh my.
The real NFTs where the images we generated along the way
The people in this image remind me of early this person does not exist, in the best way
fair point, also "this raccoon does not exist"
I tried it on the ChatGPT web UI and it also worked, although the ham radio looks like a handbag to me.
https://postimg.cc/wyxgCgNY
Nice, enjoyed the image as someone who has been to the events. But also easy raccoon placement :)
mmmm yummy OSLS?
Can it generate non halloween version though?
This lower-is-better danse macabre, nightmares inducing ratio feels like interesting proxy for models capability.
I found it on the 2nd image! On the 1st one not yet...
Cost me < 1 cents - https://elsrc.com/elsrc/waldo/wojak.jpg
And this medium quality, high resolution https://elsrc.com/elsrc/waldo/10_wojaks.jpg was 13cents
p.s. aaaand that's soft launch my SaaS above, you can replace wojak.jpg with anything you want and it will paint that. It's basically appending to prompt defined by elsrc's dashboard. Hopefully a more sane way to manage genai content. Be gentle to my server, hn!
>I think that image cost 40 cents.
Kinda made me sad assuming the author didn't license anything to OpenAI.
I recognize it could revert (99% of?) progress if all the labs moved to consent-based training sets exclusively, but I can't think of any other fair way.
$.40 does not represent the appropriate value to me considering the desirability of the IP and its earning potential in print and elsewhere. If the world has to wait until it’s fair, what of value will be lost? (I suppose this is where the big wrinkle of foreign open weight models comes in.)
License what? The concept of a hidden object search? The only stylistic similarity here is the viewing angle. Where’s Waldo comics are flat, brightly colored line drawings that look nothing like this at all.
1 reply →
> though the problem with Where's Waldo tests is that I don't have the patience to solve them for sure
I see an opportunity for a new AI test!
There have already been several attempts to procedurally generate Where’s Waldo? style images since the early Stable Diffusion days, including experiments that used a YOLO filter on each face and then processed them with ADetailer.
It's a difficult test for genai to pass. As I mentioned in a different thread, it requires a holistic understanding (in that there can only be one Waldo Highlander style), while also holding up to scrutiny when you examine any individual, ordinary figure.
I've actually been feeding them into Claude Opus 4.7 with its new high resolution image inputs, with mixed results - in one case there was no raccoon but it was SURE there was and told me it was definitely there but it couldn't find it.
Really hard to look at these images given how not human like the humans are. A few are ok, but a lot are disfigured or missing parts and its hard to find a raccoon in here.
Thanks for the image, I will see their faces in my nightmares.
This happens all too frequently when you ask a GenAI model to create an image with a large crowd especially a “Where’s Waldo?” style scenes, where by definition you’re going to be examining individual faces very closely.
What about the faces of the people ChatGPT killed?
Like... this has things that AI will seemingly always be terrible at?
At some point the level of detail is utter garbo and always will be. An artist who was thoughtful could have some mistakes but someone who put that much time into a drawing wouldn't have:
- Nightmarish screaming faces on most people
- A sign that points seemingly both directions, or the incorrect one for a lake and a first AID tent that doesn't exist
- A dog in bottom left and near lake which looks like some sort of fuzzy monstrosity...
It looks SO impressive before you try to take in any detail. The hand selected images for the preview have the same shit. The view of musculature has a sternocleidomastoid with no clavicle attachment. The periodic table seems good until you take a look at the metals...
We're reconfiguring all of our RAM & GPUs and wasting so much water and electricity for crappier where's Waldos??
AI will seemingly always be ...
You do realize that the whole image generation field is barely 10 years old?
I remember how I was able to generate mnist digits for the first time about 10 years ago - that seemed almost like magic!
The second 4K image definitely has a raccoon on the left there! Nice.
That is a devilishly difficult prompt for current diffusion tasks. Kudos.
haha took me a while to notice that one of the buildings is labelled 'Ham radio'
Damn. There’s a fun game app to make here ^^
Is there? The moment you look closely at the puzzle (which is... the whole point of Where's Waldo), you notice all the deformities and errors.
Yes, it’s not there yet. But nothing unsolvable. First thing that comes to mind would be generating smaller portion at the same resolution, then expand through tiling (although one might need to use another service & model for this), like we used to do with Stable Diffusion years ago.
Another option would be generating these large images, splitting them into grids, and using inpainting on each "tile" to improve the details. Basically the reverse of the first one.
Both significantly increase costs, but for the second one having what Images 2.0 can produce as an input could help significantly improve the overall coherence.
Yes sounds more like a fun research project instead.
I see the raccoon
5.4 thinking says "Just right of center, immediately to the right of the HAM RADIO shack. Look on the dirt path there: the raccoon is the small gray figure partly hidden behind the woman in the red-and-yellow shirt, a little above the man in the green hat. Roughly 57% from the left, 48% from the top."
(I don't think it's right).
I tried
> please add a giant red arrow to a red circle around the raccoon holding a ham radio or add a cross through the entire image if one does not exist
and got this. I'm not sure I know what a ham radio looks like though.
https://i.ritzastatic.com/static/ffef1a8e639bc85b71b692c3ba1...
Also, the racoon it circled isn't in the original.
4 replies →
That's excellent. I added it to my post: https://simonwillison.net/2026/Apr/21/gpt-image-2/#update-as...
hilarious - i tried and got the same thing.
there was a very large bear in the first image; when asked to circle the raccoon it just turned the bear into a giant raccoon and circled it.