Comment by nl
9 hours ago
Isn't this just an awkward way of adding an extra layer to the NN, except without end-to-end training?
Models like Stable Diffusion sort of do a similar thing using Clip embeddings. It works, and it's an easy way to benefit from the pre-training Clip has. But for a language model it would seemingly make more sense to just add the extra layer.
I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.
I'm just focusing on different parts