← Back to context

Comment by strix_varius

10 days ago

This is inherent in the architecture of chatgpt. It's a unified model: text, images, etc all become tokenized input. It's similar to re-encoding your image in a lossy format, the format is just the black box of chatgpt's latent space.

This leads to incredibly efficient, dense semantic consistency because every object in an image is essentially recreated from (intuitively) an entire chapter of a book dedicated to describing that object's features.

However, it loses direct pixel reference. For some things that doesn't matter much, but humans are very discerning regarding faces.

Chatgpt is architecturally unable to reproduce exactly the input pixels - they're always encoded into tokens, then decoded. This matters more for subjects for which we are sensitive to detail loss, like faces.

Encoding/decoding tokens doesn't automatically mean lossy. Images, at least in term of raw pixels can be a very inefficient form of storing information from information theoretic perspective.

Now, the difficulty is in achieving an encoding/decoding scheme that is both: information efficient AND semantically coherent in latent space. Seems like there is a tradeoff here.