Comment by dannyw

1 year ago

Perhaps images aren’t tokens at all… and 170 tokens is just an approximation of the compute cost.

3 comments

dannyw

I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.

qarl 1 year ago

They address this question in the article.