Comment by johnb231

9 months ago

The latest models are natively multimodal. Audio, video, images, text, are all tokenised and interpreted in the same model.

0 comments