Comment by tantalor

2 years ago

> CLIP embeds the entire image as a single vector, not 170 of them.

Single token?

> GPT-4o must be using a different, more advanced strategy internally

Why

1 comment

tantalor

The embeddings do not offer the level of fidelity to recognize fine details on an image, hand writing for example.