Comment by doctorpangloss

8 days ago

lots of words

okay, look at imagen 4 ultra:

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”

Is Imagen thinking?

Let's compare to gemini 2.5 flash image (nano banana):

look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?

compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0

without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2

We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.

Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.