Comment by iknownothow

9 months ago

I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.

The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.

If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!

EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.

I've pondered a bit more about it and I was the one who was mistaken. I think the author made great observations. It's just that I don't want to go back to non token thinking. I don't want there to be a 13x13xE final output from the CNN. I really want there to be a visual vocabulary from which tokens are chosen. And I want this visual vocabulary to be fixed/untrainable/deterministic. That'd be very cool.

But why only choose 13x13 + 1? :(

I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.

And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...

Input embeddings are taken from a dictionary in case of text tokens, but they don’t need to be - they can be any vector really.

  • But don't input embeddings need to undergo backprop during training? Won't the external-model's embeddings just be noise since they don't share embedding space with the model that is being trained?

    If the external-model also undergoes training along with the model then I think that might work.