Comment by viraptor

1 year ago

> That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.

This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.

2 comments

viraptor

Reply

pseudosavant 1 year ago

They are almost certainly tokenized in most LLM multi-modal models. https://en.wikipedia.org/wiki/Large_language_model#Multimoda...

viraptor 1 year ago

Ah, an overloaded "tokenizer" meaning. "split into tokens" vs "turned into a single embedding matching a token" I've never heard it used that way before, but it makes sense kinda.