Comment by minimaxir

1 day ago

I...worked on the detailed Nano Banana prompt engineering analysis for months (https://github.com/minimaxir/gemimg) without pushing a new version by passing:

    g = GemImg(model="gemini-3-pro-image-preview")

I'll add the new output resolutions and other features ASAP. However, looking at the pricing (https://ai.google.dev/gemini-api/docs/pricing#standard_1), I'm definitely not changing the default model to Pro as $0.13 per 1k/2k output will make it a tougher sell.

EDIT: Something interesting in the docs: https://ai.google.dev/gemini-api/docs/image-generation#think...

> The model generates up to two interim images to test composition and logic. The last image within Thinking is also the final rendered image.

Maybe that's partially why the cost is higher: it's hard to tell if intermediate images are billed in addition to the output. However, this could cause an issue with the base gemimg and have it return an intermediate image instead of the final image depending on how the output is constructed, so will need to double-check.

>> - Put a strawberry in the left eye socket. >>- Put a blackberry in the right eye socket.

>> All five of the edits are implemented correctly

This is a GREAT example of the (not so) subtle mistakes AI will make in image generation, or code creation, or your future knee surgery. The model placed the specified items in the eye sockets based on the viewers left/right; when we talk relative in this scenario we usually (always?) mean from the perspective of the target or "owner". Doctors make this mistake too (they typically mark the correct side with a sharpie while the patient is still alert) but I'd be more concerned if we're "outsourcing" decision making without adequate oversight.

https://minimaxir.com/2025/11/nano-banana-prompts/#hello-nan...

  • There's a classic well-illustrated book, _How to Keep Your Volkswagen Alive_, which spends a whole illustrated page at the beginning building up a reference frame for working on the vehicle. Up is sky, down is ground, front is always vehicle's front, left is always vehicle's left.

    Sounds a bit silly to write it out, but the diagram did a great job removing ambiguity when you expect someone to be laying on the ground in a tight place looking backwards, upside down.

    Also feels important to note that in the theatre, there is stage-right and stage-left, jargon to disambiguate even though the jargon expects you to know the meaning to understand it.

    • Port and starboard

      I guess car people use “driver side” and passenger side”, but the same car might be sold in mirror image versions

  • > when we talk relative in this scenario we usually (always?) mean from the perspective of the target or "owner".

    I dunno... I feel pretty confident 99% percent of people would do the same thing, and put the strawberry in the eye socket to our left, the viewer's.

    You really have to be trained explicitly to put yourself in the subject's shoes, and very few people are. To me, the model is correctly following the instructions most people will mean.

    And it's not even incorrect. "The left x" is linguistically ambiguous. If you say "the left flower", it's obviously the flower to our left. So when you say "the left eye socket", the eye socket to our left is a valid interpretation. If they had said their or its left eye socket, then it's more arguable that it must be from the subject's side. But that's not the case in this example.

  • >This is a GREAT example of the (not so) subtle mistakes AI will make in image generation, or code creation, or your future knee surgery.

    The mistake is in the prompting (not enough information). The AI did the best it could

    "What's the biggest known planet" "Jupiter" "NO I MEANT IN THE UNIVERSE!"

  • That was a big problem when I was toying around the original Nano Banana. I always prompted the perspective of the (imaginary) camera, and yet NB often interpreted that as that of the target, giving no way to select the opposite side. Since the selected side is generally closer to the camera, my usual workaround is to force the side far from the camera. And yet that was not perfect.

  • I don't know if that's so much a mistake as it is ambiguity though? To me, using the viewer's perspective in this case seems totally reasonable.

    Does it still use the viewer's perspective if the prompt specifies "Put a strawberry in the _patient's left eye_"? If it does, then you're onto something. Otherwise I completely disagree with this.

  • I meant to add a clarification to that point (because the ambiguity is a valid counterpoint), thanks for the reminder.

In case anyone missed Max's Nano Banana prompting guide, it's absolutely the definitive manual for prompting the original Nano Banana... and I tried some of the prompts in there against Nano Banana Pro and found it to be very applicable to the new model as well.

https://minimaxir.com/2025/11/nano-banana-prompts/#hello-nan...

My recreations of those pancake batter skulls using Nano Banana Pro: https://simonwillison.net/2025/Nov/20/nano-banana-pro/#tryin...

Minor clarification, the cost for every input image is $0.0011, not $0.06.

  • I was going off the footnote of "Image input is set at 560 tokens or $0.067 per image" but 560 * 2 / 1_000_000 is indeed $0.0011 so I have no idea where the $0.067 came from. Fixed, and this is why I typically don't read docs without coffee.

I just pushed gemimg 0.3.2 which adds image_size support for Nano Banana Pro, and I ran a few tests on some of the images in the blog. In my testing, Nano Banana Pro correctly handled most of the image generation errors noted in my blog post: https://x.com/minimaxir/status/1991580127587921971

- Fibonacci magnets: code is correctly indented and the syntax highlighting atleast tries giving variables, numbers, and keywords different colors.

- Make me a Studio Ghibli: actually does style transfer correctly, and does it better than ChatGPT ever did.

- Rendering a webpage from HTML: near-perfect recreation of the HTML, including text layout and element sizing.

That said, there may be regressions where even with prompt engineering, the generated images which are more photorealistic look too good and land back into the uncanny valley. I haven't decided if I'm going to write a follow up blog post yet.

The system prompt hacking trick doesn't work with Nano Banana Pro unfortunately.

Your wrapper is awesome and still relevant.

> "I...worked on the detailed Nano Banana prompt engineering analysis for months"

Early in four decades of tech innovation I wasted time layering on fixes for clear deficiencies in a snowballing trend's tech offerings. If it's a big enough trend to have well funded competitors, just wait. The concern is likely not unique, and will likely be solved tomorrow.

I realized it's better to learn adaptive/defensive techniques, giving your product resilience to change. Your goal is that when surfing the change waves you can pick a point you like between rock solid and cutting edge and surf there safely.

Invest that "remediate their thing" time in "change resilience" instead – pays dividends from then on. It can be argued your tool is in this camp!

// Getting better at this also helps you with zero days.

btw you should get on their Trusted Testers program, they do give early heads up

GDM folks, get Max on!

yes they are pricey but the price will go down over time and then you can switch. vlm.run got access as early customers and are releasing it for free with unlimited generations(till they are bottlenecked by google). some results here combining image gen(Nano Banana pro) with video gen(Veo 3.1) in a single chat https://chat.vlm.run/c/1c726fab-04ef-47cc-923d-cb3b005d6262. This combined the synth generation of a person and made the puppet dance. Quite impressive

> The model generates up to two interim images to test composition and logic. The last image within Thinking is also the final rendered image.

I've been using a bespoke Generative Model -> VLM Validator -> LLM Prompt Modifier REPL as part of my benchmarks for a while now so I'd be curious to see how this stacks up. From some preliminary testing (9 pointed star, 5 leaf clover, etc) - NB Pro seems slightly better than NB though it still seems to get them wrong. It's hard to tell what's happening under the covers.

This reminds me of the journalist working for months on uncovering Trump's dirty business just for Trump himself to admit the entire thing in a tweet.

  • It's written to mimic that style but without meaning that the work has been done for them, just that there is new work to be done, making it an odd perhaps unconscious reference

this is pretty cool! have you found success with image editing in nano banana - i mean photoshop-like stuff. from your article i seem to wonder if nano banana is good for editing versus generating new images.