← Back to context

Comment by dannyw

9 months ago

Perhaps images aren’t tokens at all… and 170 tokens is just an approximation of the compute cost.

I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.