Comment by fariszr

8 months ago

This is the gpt 4 moment for image editing models. Nano banana aka gemini 2.5 flash is insanely good. It made a 171 elo point jump in lmarena!

Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111

I've been testing it for several weeks. It can produce results that are truly epic, but it's still a case of rerolling the prompt a dozen times to get an image you can use. It's not God. It's definitely an enormous step though, and totally SOTA.

  • If you compare to the amount of effort required in Photoshop to achieve the same results, still a vast improvement

  • The model seems good, but it seems to have huge issues in doing garbage most of times lol.

    Still needs more RLHF tuning I guess? As the previous version was even worse.

  • Is it because the model is not good enough at following the prompt, or because the prompt is unclear?

    Something similar has been the case with text models. People write vague instructions and are dissatisfied when the model does not correctly guess their intentions. With image models it's even harder for model to guess it right without enough details.

    • Remember in image editing, the source image itself is a huge part of the prompt, and that's often the source of the ambiguity. The model may clearly understand your prompt to change the color of a shirt, but struggle to understand the boundaries of the shirt. I was just struggling to use AI to edit an image where the model really wanted the hat in the image to be the hair of the person wearing it. My guess for that bias is that it had just been trained on more faces without hats than with them on.

    • No, my prompts are very, very clear. It just won't follow them sometimes. Also this model seems to prefer shorter prompts, in my experience.

Before AI, people complained that Google was taking world class engineering talent and using it for little more than selling people ads.

But look at that example. With this new frontier of AI, that world class engineering talent can finally be put to use…for product placement. We’ve come so far.

  • > finally be put to use…for product placement.

    Did you think that Google would just casually allow their business to be disrupted without using the technology to improve the business and also protecting their revenue?

    Both Meta and Google have indicated that they see Generative AI as a way to vertically integrate within the ad space, disrupting marketing teams, copyrighters, and other jobs who monitor or improve ad performance.

    Also FWIW, I would suspect that the majority of Google engineers don't work on an ad system, and probably don't even work on a profitable product line.

  • Oh come on - you have this incredible technology at your disposal and all you can think to use it for is product placement?

  • I am pretty sure a lot of said engineering talent isn't actually contributing to AI but doing other stuff

Another nitpick - the pink puffer jacket that got edited into the picture is not the same as the one in the reference image - it's very similar but if I were to use this model for product placement, or cared about these sort of details, I'd definitely have issues with this.

  • Even in the just-photoshop-not-ai days product photos had become pretty unreliable as a means of understanding what you're buying. Of course it's much worse now.

    • Note: Please understand that monitor may color different. If image does not match product received then kindly your monitor calibration. Seller not responsible. /ebay&amazon

      1 reply →

Alarming hands on the third one: it can't decide which way they're facing. But Gemini didn't introduce that, it's there in the base image.

  • Yes, the base image's hands are creepy.

    • I noticed the AI pattern on the sunglasses first. I guess all of the source images are AI-generated? In a sense, that makes the result slightly less impressive -- is it going to be as faithful to the original image when the input isn't already a highly likely output for an AI model? Were the input images generated with the same model that's being used to manipulate them?

      1 reply →

It seems like every combination of "nano banana" is registered as a domain with their own unique UI for image generation... are these all middle actors playing credit arbitrage using a popular model name?

  • I'd assume they are just fake, take your money and use a different model under the hood. Because they already existed before the public release. I doubt that their backend rolled the dice on LMArena until nano-banana popped up. And that was the only way to use it until today.

    • Agreed, I didn't mean to imply that they were even attempting to run the actual nano banana, even through LMarena.

      There is a whole spectrum of potential sketchiness to explore with these, since I see a few "sign in with Google" buttons that remind me of phishing landing pages.

  • They're almost all scams. Nano banana AI image generator sites were showing up when this model was still only available in LM Arena.

Completely agree - I make logos for my github projects for fun, and the last time I tried SOTA image generation for logos, it was consistently ignoring instructions and not doing anything close to what i was asking for. Google's new release today did it near flawlessly, exactly how I wanted it, in a single prompt. A couple more prompts for tweaking (centering it, rotating it slightly) got it perfect. This is awesome.

Regardless, it seems Google is on the frontier of every type of model and robotics (cars). It’s nutty how we forget what a intellectual juggernaut they are.

I wonder how the creative workflow looks like when this kind of models are natively integrated into digital image tools. Imagine fine-grained controls on each layer and their composition with the semantic understanding on the full picture.

Why is it called nano banana?

No, it's not really that much of an improvement. Once you start coming up with specific tasks, it fails just like the others.

> This is the gpt 4 moment for image editing models.

No it's not.

We've had rich editing capabilities since gpt-image-1, this is just faster and looks better than the (endearingly? called) "piss filter".

Flux Kontext, SeedEdit, and Qwen Edit are all also image editing models that are robustly capable. Qwen Edit especially.

Flux Kontext and Qwen are also possible to fine tune and run locally.

Qwen (and its video gen sister Wan) are also Apache licensed. It's hard not to cheer Alibaba on given how open they are compared to their competitors.

We've left the days of Dall-E, Stable Diffusion, and Midjourney of "prompt-only" text to image generation.

It's also looking like tools like ComfyUI are less and less necessary as those capabilities are moving into the model layer itself.

  • In other words, this is the gpt 4 moment for image editing models.

    Gpt4 isn't "fundamentally different" from gpt3.5. It's just better. That's the exact point the parent commenter was trying to make.

  • I'm confused as well, I thought gpt-image could already do most of these things, but I guess the key difference is that gpt-image is not good for single point edits. In terms of "wow" factor it doesn't feel as big as gpt 3->4 though, since it sure _felt_ like models could already do this.

    • People really slept on gpt-image-1 and were too busy making Miyazaki/Ghibli images.

      I feel like most of the people on HN are paying attention to LLMs and missing out on all the crazy stuff happening with images and videos.

      LLMs might be a bubble, but images and video are not. We're going to have entire world simulation in a few years.