Comment by iknownothow

9 months ago

I've pondered a bit more about it and I was the one who was mistaken. I think the author made great observations. It's just that I don't want to go back to non token thinking. I don't want there to be a 13x13xE final output from the CNN. I really want there to be a visual vocabulary from which tokens are chosen. And I want this visual vocabulary to be fixed/untrainable/deterministic. That'd be very cool.

But why only choose 13x13 + 1? :(

I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.

And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...