Comment by rafaelero

9 months ago

They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.

Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?

  • VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.