Comment by HarHarVeryFunny

9 months ago

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?

VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.