Comment by mFixman

8 days ago

The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!

This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.

14 comments

mFixman

jonas21 8 days ago

I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."

kjeksfjes 8 days ago
Exactly what I was thinking too. I'm a designer, and I'm used to receiving feedback and instructions. "The left eye socket" would to me refer to what I currently see in front of me, while "its left eye socket" instantly shift the perspective from me to the subject.
- bear141 7 days ago
  
  I find this interesting. I've always described things from the users point of view. Like the left side of a car, regardless of who is looking at it from what direction, is the driver side. To me, this would include a body.
  
  1 reply →

martin-adams 8 days ago

I picked up on that also. I feel that a lot of humans would also get confused about whether you mean the eye on the left, or the subject's left eye.

Closi 8 days ago
To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
https://gemini.google.com/share/a024d11786fc
- ffsm8 8 days ago
  
  Mmh, ime you need to discard the session/rewrite the failing prompt instead of continuing and correcting on failures. Once errors occur you've basically introduced a poison pill which will continuously make things to haywire. Spelling out what it did wrong is the most destructive thing you can do - at least in my experience
- astrange 8 days ago
  
  Almost no image/video models can do "upside-down" either.
- basch 8 days ago
  
  to the point where you can say, raise the left arm and then raise the right arm and get the same image with the same arm raised.

minimaxir 8 days ago

I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later in the post.

For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.

frumiousirc 7 days ago
The lack of proper indentation (which you noted) in the Python fib() examples was even more apparent. The fact that both AIs you tested failed in the same way is interesting. I've not played with image generation, is this type of failure endemic?
- minimaxir 7 days ago
  
  My hunch in that case is that the composition of the image implied left-justified text which overwrote the indentation rule.

sib 8 days ago

Came to make exactly the same comment. It was funny that the author specifically said that Nano Banana got all five edit prompts correct, rather than noting this discrepancy, which could be argued either way (although I think the "right eye" of a skull should be interpreted with respect to the skull's POV.)

zulban 7 days ago

Extroverts tend to expect directions from the perspective of the skull. Introverts tend to expect their own perspective for directions. It's a psychology thing, not an error.