← Back to context

Comment by vunderba

8 days ago

Good read minimaxir! From the article:

> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.

In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.

> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.

This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.

[1] - https://mordenstar.com/portfolio/zeno-paradox

It might also be an explicit guard against Studio Ghibli specifically after the "make me Ghibli" trend a while back, which upset Studio Ghibli (understandably so).