← Back to context

Comment by ashleyn

21 hours ago

Does anyone know if this is predicting the entire image at once, or if it's breaking it into constituent steps i.e. "draw text in this font at this location" and then composing it from those "tools"? It would be really interesting if they've solved the garbled text problem within the constraint of predicting the entire image at once.

I strongly suspect it's the latter, though someone please chime in if I'm wrong.

Even so, this is a real advancement. It's impressive to see existing techniques combined to meaningfully improve on SOTA image generation.

The previous nano banana was using composing tools. It was really obvious by some of the janky outputs it made. Not sure about this one, but presumably they built off it.

There still is some garbled text sometimes so it can't be the latter (try to get it to generate a map of 48 us states labeled - the ones that are too small to write on and need arrows were garbled (1 attempt))

I’m pretty sure, but no expert on the matter, that correct text rendering was solved by feeding in bitmaps of rasterized fonts as supplemental context to the image generation models.