← Back to context

Comment by mFixman

8 days ago

The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!

This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.

I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."

  • Exactly what I was thinking too. I'm a designer, and I'm used to receiving feedback and instructions. "The left eye socket" would to me refer to what I currently see in front of me, while "its left eye socket" instantly shift the perspective from me to the subject.

    • I find this interesting. I've always described things from the users point of view. Like the left side of a car, regardless of who is looking at it from what direction, is the driver side. To me, this would include a body.

      1 reply →

I picked up on that also. I feel that a lot of humans would also get confused about whether you mean the eye on the left, or the subject's left eye.

  • To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.

    See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:

    https://gemini.google.com/share/a024d11786fc

    • Mmh, ime you need to discard the session/rewrite the failing prompt instead of continuing and correcting on failures. Once errors occur you've basically introduced a poison pill which will continuously make things to haywire. Spelling out what it did wrong is the most destructive thing you can do - at least in my experience

    • to the point where you can say, raise the left arm and then raise the right arm and get the same image with the same arm raised.

I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later in the post.

For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.

  • The lack of proper indentation (which you noted) in the Python fib() examples was even more apparent. The fact that both AIs you tested failed in the same way is interesting. I've not played with image generation, is this type of failure endemic?

    • My hunch in that case is that the composition of the image implied left-justified text which overwrote the indentation rule.

Came to make exactly the same comment. It was funny that the author specifically said that Nano Banana got all five edit prompts correct, rather than noting this discrepancy, which could be argued either way (although I think the "right eye" of a skull should be interpreted with respect to the skull's POV.)

Extroverts tend to expect directions from the perspective of the skull. Introverts tend to expect their own perspective for directions. It's a psychology thing, not an error.