Comment by andy12_

2 months ago

Because the general idea here is that image and video models, when scaled way up, can generalize like text models did[1], and eventually be treated as "world models"[2]; models that can accurately model real world processes. These "world models" then could be used to train embodied agents with RL in an scalable way[3]. The video-slop and image-slop generators is just a way to take advantage of the current research in world models to get more out of it.

[1] https://arxiv.org/pdf/2509.20328

[2] https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...

[3] https://arxiv.org/pdf/2509.24527

0 comments

andy12_

No comments yet

Contribute on Hacker News ↗