Comment by whoaoweird

10 days ago

It was interesting to see how often the OpenAI model changed the face of the child. Often the other two models wouldn't, but OpenAI would alter the structure of their head (making it rounder), eyes (making them rounder), or altering the position and facing of the children in the background.

It's like OpenAI is reducing to some sort of median face a little on all of these, whereas the other two models seemed to reproduce the face.

For some things, exactly reproducing the face is a problem -- for example in making them a glass etching, Gemini seemed unwilling to give up the specific details of the child's face, even though that would make sense in that context.

It looks to me like OpenAI's image pipeline takes an image as input, derives the semantic details, and then essentially regenerates an entirely new image based on the "description" obtained from the input image.

Even Sam Altman's "Ghiblified" twitter avatar looks nothing like him (at least to me).

Other models seem much more able to operate directly on the input image.

  • You can see this in the images of the Newton: in GPT's versions, the text and icons are corrupted.

  • Isn't this from the model working o. really low res images, and then bein uppscalef afterwards?

This is inherent in the architecture of chatgpt. It's a unified model: text, images, etc all become tokenized input. It's similar to re-encoding your image in a lossy format, the format is just the black box of chatgpt's latent space.

This leads to incredibly efficient, dense semantic consistency because every object in an image is essentially recreated from (intuitively) an entire chapter of a book dedicated to describing that object's features.

However, it loses direct pixel reference. For some things that doesn't matter much, but humans are very discerning regarding faces.

Chatgpt is architecturally unable to reproduce exactly the input pixels - they're always encoded into tokens, then decoded. This matters more for subjects for which we are sensitive to detail loss, like faces.

  • Encoding/decoding tokens doesn't automatically mean lossy. Images, at least in term of raw pixels can be a very inefficient form of storing information from information theoretic perspective.

    Now, the difficulty is in achieving an encoding/decoding scheme that is both: information efficient AND semantically coherent in latent space. Seems like there is a tradeoff here.

I've noticed that OpenAI modifies faces on a regular basis. I was using it to try and create examples of different haircuts and the face would randomly turn into a different face -- similar but noticeably changed. Even when I prompted to not modify the face, it would do it regardless. Perhaps part of their "safety" for modifying pictures of people?

I had thought it was a deliberate choice to avoid potential abuse, however Sora put an end to that line of thinking.