Comment by bryan0
2 days ago
The Matryoshka embeddings seem interesting:
> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]
looks like even the 256-dim embeddings perform really well.
[0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-for...
The Matryoshka trick is really neat - there's a good explanation here: https://huggingface.co/blog/matryoshka
I've seen it in a few models now - Nomic Embed 1.5 was the first https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
OpenAI did it a few weeks earlier when they released text-embedding-3-large, right?
Huh, yeah you're right: that was January 25th 2024 https://openai.com/index/new-embedding-models-and-api-update...
Nomic 1.5 was February 14th 2024: https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
Does OpenAI's text-embedding-3-large or text-embedding-3-small embedding model have the Matryoshka property?
They do, they just don't advertise it well (and only confirmed it with a footnote after criticism of its omission): https://openai.com/index/new-embedding-models-and-api-update...
> Both of our new embedding models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.
It's interesting, but the improvement they're claiming isn't that groundbreaking.
Google teams seem to be in love with that Matryoshka tech. I wonder how far that scales.
It's a practical feature. Scaling is irrelevant in this context because it scales to the length of the embedding, although in batches of k-length embeddings.