Comment by vunderba

3 months ago

Alright results are in! I've re-run all my editing based adherence related prompts through Nano Banana Pro. NB Pro managed to successfully pass SHRDLU, the M&M Van Halen test (as verified independently by Simon), and the Scorpio street test - all of which the original NB failed.

  Model results
  1. Nano Banana Pro: 10 / 12
  2. Seedream4: 9 / 12
  3. Nano Banana: 7 / 12
  4. Qwen Image Edit: 6 / 12

https://genai-showdown.specr.net/image-editing

If you just want to see how NB and NB Pro compare against each other:

https://genai-showdown.specr.net/image-editing?models=nb,nbp

39 comments

vunderba

handsclean 3 months ago

Please consider changing pass/fail to an integer score out of maybe 5. This test is becoming more and more misleading as your apparent desire to give due credit conflicts with quality improvements over already ok-ish models. For example, on the great wave Gemini 3’s excellent rendition gets no additional credit over Qwen technically not failing if one is generous, and on cards, there’s actually no score distinction between results that one could or could not use.

tylervigen 3 months ago

I think Nano banana pro’s answer to the giraffe edit is far superior to the Seedream response, but you passed Seedream and failed NB pro.

Maybe that one is just not a good test?

handsclean 3 months ago

I thought so too at first, but zoom in to where the neck joins the head. What looks like the head’s shadow from a distance is actually a hard seam between thick neck and thin neck, with much of the apparent shadow actually a cutout showing the background.
Looks like the Seedream result here has been changed to fail, which I’d agree with, too. Pose change complaints aside, I think that neck is actually the same length were it held straight.
tziki 3 months ago

I agree, it seems like Seedream has the neck at same length as Nano Banana but also made the giraffe crouch down, making a major modification to the overall picture.
strbean 3 months ago
If you look closely, the NBP giraffe has a gaping hole in it's neck.
- IncreasePosts 3 months ago
  
  maybe that's just how his mom built him
robertwt7 3 months ago

yeah i agree, the prompt is to "shorten the giraffe's neck length", not to bent it. i feel like the Gemini 3 produces better result on that one

sosodev 3 months ago

I think Nano Banana Pro should have passed your giraffe test. It's not a great result but it is exactly what you asked for. It's no worse than Seedream's result imo.

vunderba 3 months ago
Yeah I think that's a fair critique. It kind of looks like a bad cut-and-replace job (if you zoom in you can even see part of the neck is missing). I might give it some more attempts to see if it can do a better job.
I agree that Seedream could definitely be called out as a fail since it might just be a trick of perspective.
- sefrost 3 months ago
  
  Have you ever considered a “partial pass”?
  Perhaps it would be an easy cop out of making a decision if you had to choose something outside of pass/fail.
  
  2 replies →
aqme28 3 months ago
I don’t understand at all why Seedream gets a pass there. The neck appears the same length but now it’s at a different angle.
- vunderba 3 months ago
  
  Alright I think it's time to concede defeat! Seedream has been summarily demoted to a failure and I've added in the following minimum passing criteria to that particular test:
  - The giraffe's neck should be noticeably shorter than in the original image, while still maintaining a natural appearance.
  - The final image cannot be accomplished by simply cropping out the neck or using perspective changes.
kevlened 3 months ago

I agree. From where I'm sitting, Seedream just bent the neck while Nano Banana Pro actually shortened the neck.
jonplackett 3 months ago

Yeah it’s better than the weirdness of seedream for sure.

humamf 3 months ago

The pisa tower test is really interesting. Many of this prompt have stricter criteria with implicit knowledge and some models impressively pass it. Yet for something as obvious as straightening a slanted object is hard even for latest models.

kridsdale3 3 months ago
I suspect there'd be no problem rotating a different object. But this tower is EXTREMELY represented in the training data. It's almost an immutable law of physics that Towers in Pisa are Leaning.
- gridspy 3 months ago
  
  It's also a tower that has famously been deliberately un-straightend just enough to remain a tourist attraction while remaining stable.
  
  1 reply →

dyauspitr 3 months ago

Seedream generally looks like low quality outputs and it doesn’t seem like you’re assigning points for quality. This is only marginally helpful.

vunderba 3 months ago
That's because, for the most part, I'm not:
"A comparison of various SOTA generative image models on specific prompts and challenges with a strong emphasis placed on adherence."
Adherence is the more interesting problem, in my opinion, because quality issues can be ameliorated through the use of upscalers, refiner models, LoRAs, and similar tools. Furthermore, there are already a thousand existing benchmarks obsessed with visual fidelity.
- dyauspitr 3 months ago
  
  I mean there’s a huge difference between a model that throws a black spot on someone’s head and another one that fills it with hair indistinguishable from the real thing. Which is why I’m saying this methodology is only marginally useful.

Nifty3929 3 months ago

Would you leave one of the originals in each test visible at all times (a control) so that I can see the final image(s) that I'm considering and the original image at the same time?

I guess if you do that then maybe you don't need the cool sliders anymore?

Anyway - thanks so much for all your hard work on this. A very interesting study!

rl3 3 months ago

"Remove all the trash from the street and sidewalk. Replace the sleeping person on the ground with a green street bench. Change the parking meter into a planted tree."

Three sentences that do a great job summing up modern big tech. The new model even manages to [digitally] remove all trash.

andrepd 3 months ago
Yep, no need for actual urbanism or to worry about the homeless, now governments and realtors can lie to you more conveniently and at an industrial scale! Yay future
- jamiek88 3 months ago
  
  And one day we’ll wear glasses that do the same! Then we can solve (ignore) all problems!
  
  1 reply →
noduerme 3 months ago

The better to sell you real estate...

tiagod 3 months ago

Cool site, thanks! By the way, the "Before" and "After" buttons are swapped.

noduerme 3 months ago

I had to look up what a "skifter" is. An AI answer showed that it's Norwegian for a switch.

I'm curious, does the word have a further meaning in the context of cheating at cards?

vunderba 3 months ago
It's an admittedly obscure reference to a cheating technique used in the Star Wars card game sabacc, which allows a player to surreptitiously switch out a card. I’m pretty sure I picked it up from one of Timothy Zahn's Thrawn books when I was a kid.
But I didn't know it had a meaning in Norwegian, so I guess TIL!
- noduerme 3 months ago
  
  Hah. I loved those Timothy Zahn books. Don't remember that one, though!

Wyverald 3 months ago

thanks, I love your website. Are you planning to do NB Pro for the text-to-image benchmark too?

vunderba 3 months ago
Outside the time frame of being able to edit my original reply, but I've finally re-run the Text-to-Image portion of the site through NB Pro.
Results gpt-image-1: 10 / 12 Nano Banana Pro: 9 / 12 Nano Banana: 8 / 12
It's worth mentioning that even though it only scored slightly better than the original NB, many of the images are significantly better looking.
https://genai-showdown.specr.net?models=nb,nbp
- happyopossum 3 months ago
  
  Awesome test suite. For the maze though, not sure it’s fair to knock it for extra dashed lines as the prompt didn’t specify that only the correct path should have one…
- Wyverald 3 months ago
  
  thanks for the update. One small note: for the d20 test, NB Pro had duplications of 13 and 17 too, not just 19.
  
  1 reply →
vunderba 3 months ago

Definitely! Even though NB's predominant use case seems to be editing, it's still producing surprisingly decent text-to-image results. Imagen4 currently still comes out ahead in terms of image fidelity, but I think NB Pro will close the gap even further.
I'll try to have the generative comparisons for NB Pro up later this afternoon once I catch my breath.