Comment by andy12_
3 hours ago
Because the general idea here is that image and video models, when scaled way up, can generalize like text models did[1], and eventually be treated as "world models"[2]; models that can accurately model real world processes. These "world models" then could be used to train embodied agents with RL in an scalable way[3]. The video-slop and image-slop generators is just a way to take advantage of the current research in world models to get more out of it.
[1] https://arxiv.org/pdf/2509.20328
[2] https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...
No comments yet
Contribute on Hacker News ↗