Comment by vunderba

19 hours ago

Alright results are in! I've re-run all my editing based adherence related prompts through Nano Banana Pro. NB Pro managed to successfully pass SHRDLU, the M&M Van Halen test (as verified independently by Simon), and the Scorpio street test - all of which the original NB failed.

  Model results
  1. Nano Banana Pro: 10 / 12
  2. Seedream4: 9 / 12
  3. Nano Banana: 7 / 12
  4. Qwen Image Edit: 6 / 12

https://genai-showdown.specr.net/image-editing

If you just want to see how NB and NB Pro compare against each other:

https://genai-showdown.specr.net/image-editing?models=nb,nbp

Please consider changing pass/fail to an integer score out of maybe 5. This test is becoming more and more misleading as your apparent desire to give due credit conflicts with quality improvements over already ok-ish models. For example, on the great wave Gemini 3’s excellent rendition gets no additional credit over Qwen technically not failing if one is generous, and on cards, there’s actually no score distinction between results that one could or could not use.

I think Nano banana pro’s answer to the giraffe edit is far superior to the Seedream response, but you passed Seedream and failed NB pro.

Maybe that one is just not a good test?

  • I thought so too at first, but zoom in to where the neck joins the head. What looks like the head’s shadow from a distance is actually a hard seam between thick neck and thin neck, with much of the apparent shadow actually a cutout showing the background.

    Looks like the Seedream result here has been changed to fail, which I’d agree with, too. Pose change complaints aside, I think that neck is actually the same length were it held straight.

  • I agree, it seems like Seedream has the neck at same length as Nano Banana but also made the giraffe crouch down, making a major modification to the overall picture.

  • yeah i agree, the prompt is to "shorten the giraffe's neck length", not to bent it. i feel like the Gemini 3 produces better result on that one

I think Nano Banana Pro should have passed your giraffe test. It's not a great result but it is exactly what you asked for. It's no worse than Seedream's result imo.

  • Yeah I think that's a fair critique. It kind of looks like a bad cut-and-replace job (if you zoom in you can even see part of the neck is missing). I might give it some more attempts to see if it can do a better job.

    I agree that Seedream could definitely be called out as a fail since it might just be a trick of perspective.

    • Have you ever considered a “partial pass”?

      Perhaps it would be an easy cop out of making a decision if you had to choose something outside of pass/fail.

      2 replies →

  • I don’t understand at all why Seedream gets a pass there. The neck appears the same length but now it’s at a different angle.

    • Alright I think it's time to concede defeat! Seedream has been summarily demoted to a failure and I've added in the following minimum passing criteria to that particular test:

      - The giraffe's neck should be noticeably shorter than in the original image, while still maintaining a natural appearance.

      - The final image cannot be accomplished by simply cropping out the neck or using perspective changes.

  • I agree. From where I'm sitting, Seedream just bent the neck while Nano Banana Pro actually shortened the neck.

The pisa tower test is really interesting. Many of this prompt have stricter criteria with implicit knowledge and some models impressively pass it. Yet for something as obvious as straightening a slanted object is hard even for latest models.

  • I suspect there'd be no problem rotating a different object. But this tower is EXTREMELY represented in the training data. It's almost an immutable law of physics that Towers in Pisa are Leaning.

    • It's also a tower that has famously been deliberately un-straightend just enough to remain a tourist attraction while remaining stable.

      1 reply →

I had to look up what a "skifter" is. An AI answer showed that it's Norwegian for a switch.

I'm curious, does the word have a further meaning in the context of cheating at cards?

  • It's an admittedly obscure reference to a cheating technique used in the Star Wars card game sabacc, which allows a player to surreptitiously switch out a card. I’m pretty sure I picked it up from one of Timothy Zahn's Thrawn books when I was a kid.

    But I didn't know it had a meaning in Norwegian, so I guess TIL!

Would you leave one of the originals in each test visible at all times (a control) so that I can see the final image(s) that I'm considering and the original image at the same time?

I guess if you do that then maybe you don't need the cool sliders anymore?

Anyway - thanks so much for all your hard work on this. A very interesting study!

"Remove all the trash from the street and sidewalk. Replace the sleeping person on the ground with a green street bench. Change the parking meter into a planted tree."

Three sentences that do a great job summing up modern big tech. The new model even manages to [digitally] remove all trash.

  • Yep, no need for actual urbanism or to worry about the homeless, now governments and realtors can lie to you more conveniently and at an industrial scale! Yay future

thanks, I love your website. Are you planning to do NB Pro for the text-to-image benchmark too?

  • Outside the time frame of being able to edit my original reply, but I've finally re-run the Text-to-Image portion of the site through NB Pro.

      Results
    
      gpt-image-1: 10 / 12 
      Nano Banana Pro: 9 / 12
      Nano Banana: 8 / 12
    

    It's worth mentioning that even though it only scored slightly better than the original NB, many of the images are significantly better looking.

    https://genai-showdown.specr.net?models=nb,nbp

    • Awesome test suite. For the maze though, not sure it’s fair to knock it for extra dashed lines as the prompt didn’t specify that only the correct path should have one…

  • Definitely! Even though NB's predominant use case seems to be editing, it's still producing surprisingly decent text-to-image results. Imagen4 currently still comes out ahead in terms of image fidelity, but I think NB Pro will close the gap even further.

    I'll try to have the generative comparisons for NB Pro up later this afternoon once I catch my breath.

Seedream generally looks like low quality outputs and it doesn’t seem like you’re assigning points for quality. This is only marginally helpful.

  • That's because, for the most part, I'm not:

    "A comparison of various SOTA generative image models on specific prompts and challenges with a strong emphasis placed on adherence."

    Adherence is the more interesting problem, in my opinion, because quality issues can be ameliorated through the use of upscalers, refiner models, LoRAs, and similar tools. Furthermore, there are already a thousand existing benchmarks obsessed with visual fidelity.

    • I mean there’s a huge difference between a model that throws a black spot on someone’s head and another one that fills it with hair indistinguishable from the real thing. Which is why I’m saying this methodology is only marginally useful.