Comment by vunderba
19 hours ago
OpenAI’s gpt-image-1.5 and Google’s NB2 have been pretty much neck and neck on my comparison site which focuses heavily on prompt adherence, with both hovering around a 70% success rate on the prompts for generative and editing capabilities. With the caveat being that Gemini has always had the edge in terms of visual fidelity.
That being said, gpt-image-1.5 was a big leap in visual quality for OpenAI and eliminated most of the classic issues of its predecessor, including things like the “piss filter.”
I’ll update this comment once I’ve finished running gpt-image-2 through both the generative and editing comparison charts on GenAI Showdown.
Since the advent of NB, I’ve had to ratchet up the difficulty of the prompts especially in the text-to-image section. The best models now score around 70%, successfully completing 11 out of 15 prompts.
For reference, here’s a comparison of ByteDance, Google, and OpenAI on editing performance:
https://genai-showdown.specr.net/image-editing?models=nbp3,s...
And here’s the same comparison for generative performance:
https://genai-showdown.specr.net/?models=s4,nbp3,g15
UPDATES:
gpt-image-2 has already managed to overcome one of the so‑called “model killers” on the test suite: the nine-pointed star.
Results are in for the generative (text to image) capabilities: Gpt-image-2 scored 12 out of 15 on the text-to-image benchmark, edging out the previous best models by a single point. It still fails on the following prompts:
- A photo of a brightly colored coral snake but with the bands of color red, blue, green, purple, and yellow repeated in that exact order.
- A twenty-sided die (D20) with the first twenty prime numbers (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71) on the faces.
- A flat earth-like planet which resembles a flat disc is overpopulated with people. The people are densely packed together such that they are spilling over the edges of the planet. Cheap "coastal" real estate property available.
All Models:
https://genai-showdown.specr.net
Just Gpt-Image-1.5, Gpt-Image-2, Nano-Banana 2, and Seedream 4.0
Very useful website. Would you have insight into what models are best at editing existing images?
I often have to make very specific edits while keeping the rest of the image intact and haven't yet found a good model. These are typically abstract images for experiments.
I asked gpt-image-2 to recolor specific scales of your Seedream 4 snake and change the shape of others. It did very poorly.
OpenAI actually has really good adherence, but occasionally tends to introduce its own almost equivalent of "tone mapping", making hyper-localized edits frustrating.
I don’t know how much work it is for you, but one thing a lot of people do, myself included, is take the original image, make a change to it using something like NB, then paste that as the topmost layer in something like Krita/Pixelmator. After that, we’ll mask and feather in only the parts we actually want to change. It doesn’t always work if it changes the overall color balance or filters out certain hues, it can be a real pain but it does the job in some cases.
The Flux models (like Kontext) are actually surprisingly good at making very minimal changes to the rest of the image, but unfortunately their understanding of complex prompts is much weaker than the closed, proprietary models.
I will say that I’ve found Gemini 3.0 (NB Pro) does a relatively decent job of avoiding unnecessary changes - sometimes exceeding the more recent NB2, and it scored quite well on comparative image-editing benchmarks.
https://genai-showdown.specr.net/image-editing
Thanks. I will try this! I need to read up on how to work with vision models for both generation and understanding.
Why does Gemini 3.1 get a pass for the same reasons they got image 2 gets a fail on the flat earth one? Gemini has all sorts of random body parts and limbs etc.
That's a mistake~ None of the models successfully passed the Flat Earth composition test. I've updated the passing criteria to be more explicit as well. Thanks for catching that!
It'd be interesting if you could add HunyuanImage-3 to the competition. It's better than Z-Image at almost everything I've thrown at it.
It can be (slowly) run at home, but needs 96GB RTX 6000-level hardware so it is not very popular.
I’ll have to give it another try. Its predecessor, Hunyuan Image 2.0, scored pretty poorly when I tested it last year: 2 out of 15, so it'll be interesting to see how much it has improved.
Here's ZiT, Gpt-Image-2, and Hunyuan Image 2 for reference:
https://genai-showdown.specr.net/?models=hy2,g2,zt
Note: It won't show up in some of the newer image comparisons (Angelic Forge, Flat Earth, etc) because it's been deprecated for a while but in the tests where it was used (Yarrctic Circle, Not the Bees, etc.) it's pretty rough.
It does quite a bit better than 2.0, I think. Or at least it may be stylistically different enough to justify a rematch against the others.
Ring toss: https://i.imgur.com/Zs6UNKj.png (arguably a pass)
9-pointed star: https://i.imgur.com/SpcSsSv.png (star is well-formed but only has 6 points)
Mermaid: https://i.imgur.com/R6MbMPX.png (fail, and I can't get Imgur to host it for some reason even though it's SFW)
Octopus: https://i.imgur.com/JTVH7xy.png (good try, almost a pass, but socks don't cover the ends of all the tentacles)
Above are one-shot attempts with seed 42.
3 replies →
Where can I see the actual prompts and follow ups you fed each model?
So the prompts are tuned and adjusted on a per-model basis. If you look at the number of attempts, each receives a specific prompt variation depending on the model. This honestly isn't as much of an issue these days because SOTA models natural language parsing (particularly the multimodal ones) has eliminated a lot of the byzantine syntax requirements of the SD/SDXL days.
The template prompt seen in each comparison gets adjusted through a guided LLM which has fine-tuned system prompts to rewrite prompts. The goal is to foster greater diversity while preserving intent, so the image model has a better chance of getting the image right.
Getting to your suggestion for posting all the raw prompts, that's actually a great idea. Too bad I didn't think about it until you suggested it. And if you multiply it out - there's 15 distinct test cases against 22 models at this point, each with an average of about 8 attempts so we’re talking about thousands of prompts many of which are scattered across my hard drive. I might try to do this as a future follow-up.
Shouldn’t every model get the same prompt? Seems a bit weird, especially when you can’t see the prompts that were used.
1 reply →