← Back to context

Comment by nl

9 hours ago

Isn't this just an awkward way of adding an extra layer to the NN, except without end-to-end training?

Models like Stable Diffusion sort of do a similar thing using Clip embeddings. It works, and it's an easy way to benefit from the pre-training Clip has. But for a language model it would seemingly make more sense to just add the extra layer.

I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.

I'm just focusing on different parts