Comment by in-silico

1 month ago

Sufficiently informative latents can be decoded into video.

When you simulate a stream of those latents, you can decode them into video.

If you were trying to make an impressive demo for the public, you probably would decode them into video, even if the real applications don't require it.

Converting the latents to pixel space also makes them compatible with existing image/video models and multimodal LLMs, which (without specialized training) can't interpret the latents directly.

1 comment

in-silico

soulofmischief 1 month ago

At which point you're training another model on top of the first, and it becomes clear you might as well have made one model from the start!