← Back to context

Comment by N_Lens

2 days ago

According to the paper the image models can 'recognize' and track objects in videos. There are a lot of emergent properties in both diffusion models and LLMs that don't align with simplistic descriptions such as 'next token predictor'. It's not surprising to me that 'diffusing' mass amounts of image data leads to semantic developments and the emergence of recognition.