Comment by in-silico
17 hours ago
Sufficiently informative latents can be decoded into video.
When you simulate a stream of those latents, you can decode them into video.
If you were trying to make an impressive demo for the public, you probably would decode them into video, even if the real applications don't require it.
Converting the latents to pixel space also makes them compatible with existing image/video models and multimodal LLMs, which (without specialized training) can't interpret the latents directly.
At which point you're training another model on top of the first, and it becomes clear you might as well have made one model from the start!