Comment by HarHarVeryFunny
9 months ago
Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?
If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?
VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.