← Back to context

Comment by cs702

9 months ago

One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.

Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.

At some point we're going to go from tokens to embeddings for everything. I saw some research on variable length embeddings, I wouldn't be surprised if someone generated a huge embedding space, did some form of PCA on generated embeddings, threw away low eigenvalue vectors, then trained a distilled model that generated variable length embeddings directly from that.

  • > At some point we're going to go from tokens to embeddings for everything.

    Yes, I agree.

    Further down the road, I imagine we will end up finding interesting connections to the symbolic approaches of GOFAI, given that the embedding of a token, object, concept, or other entity in some vector space is basically a kind of symbol that represents that token, object, concept, or entity in that vector space.

    Interestingly, old terms like "representation" and "capsule," which didn't become as widely adopted as "embedding," tried more explicitly to convey this idea of using vectors/matrices of feature activations to stand in for objects, concepts, and other entities.

    For example, see Figure 1 in this paper from 2009-2012: http://www.cs.princeton.edu/courses/archive/spring13/cos598C... -- it's basically what we're talking about!