Comment by Filligree

1 day ago

It’s not necessarily harder than other aspects. However:

- It requires an AI that actually understands English, I.e. an LLM. Older, diffusion-only models were naturally terrible at that, because they weren’t trained on it.

- It requires the AI to make no mistakes on image rendering, and that’s a high bar. Mistakes in image generation are so common we have memes about it, and for all that hands generally work fine now, the rest of the picture is full of mistakes you can’t tell are mistakes. Entirely impossible with text.

Nano Banana Pro seems to somewhat reliably produce entire pictures without any mistakes at all.