Harnessing the Universal Geometry of Embeddings

2 months ago (arxiv.org)

43 comments

jxmorris12

Hi HN, I'm Jack, the last author of this paper. It feels good to release this, the fruit of a two-year quest to "align" two vector spaces without any paired data. It's fun to look back a bit and note that at least two people told me this wasn't possible:

1. An MIT professor who works on similar geometry alignment problems didn't want to work on this with me because he was certain we would need at least a little bit of paired data

2. A vector database startup founder who told me about his plan to randomly rotate embeddings to guarantee user security (and ignored me when I said it might not be a good idea)

The practical takeaway is something that many people already understood, which is that embeddings are not encrypted, even if you don't have access to the model that produced them.

As one example, in the Cursor security policy (https://www.cursor.com/security#codebase-indexing) they state:

> Embedding reversal: academic work has shown that reversing embeddings is possible in some cases. Current attacks rely on having access to the model [...]

This is no longer the case. Since all embedding models are learning ~the same thing, we can decode any embedding vectors, given we have at least a few thousand of them.

jackpirate 2 months ago
I hate to be "reviewer 2", but:
I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.
Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.
I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.
---
PS.
I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.
---
PPS.
I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."
- jxmorris12 2 months ago
  
  Hey, I appreciate the perspective. We definitely should cite both those papers, and will do so in the next version of our draft. There are a lot of papers in this area, and they're all a few years old now, so you might understand how we missed two of them.
  We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables. So some of this is covered. A lot of these methods also require a seed dictionary, which we don't have in our case. That said, you're welcome to take any number of these tools and plug them into our codebase; the results would definitely be interesting, although we can expect the adversarial methods still work best, as they do in the problem settings you mention.
  As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.
  
  1 reply →
- newfocogi 2 months ago
  
  Naming things is hard. Noting the two alternative approaches that you referenced are called "vecmap" and "alignment" which "aren't the first/only algorithm for ... and you have no right to claim such a general title" could easily apply there as well.
  
  1 reply →
- mjburgess 2 months ago
  
  > I strongly dislike your name of vec2vec.
  Imagine having more than a passing understanding of philosophy, and then reading much of any major computer science papers. By this "No right to claim" logic, I'd have you all on trial.
- austinpilot 2 months ago
  
  The problem solved in this paper is strictly harder than alignment. Alignment works with multiple, unmatched representations of the same inputs (e.g, different embeddings of the same words). The goal is to match them up.
  The goal here is harder: given an embedding of an unknown text in one space, generate a vector in another space that's close to the embedding of the same text -- but, unlike in the word alignment problem, the texts are not known in advance.
  Neither unsupervised transport, nor optimal alignment can solve this problem. Their input sets must be embeddings of the same texts. The input sets here are embeddings of different texts.
  FWIW, this is all explained in the paper, including even the abstract. The comparisons with optimal assignment explicitly note that this is an idealized pseudo-baseline, and in reality OA cannot used for embedding translation (as opposed to matching, alignment, correspondence, etc.)
nimish 2 months ago

Hooray, finally we are getting the geometric analysis of embedding spaces we need. Information geometry and differential geometry is finally getting its moment in the sun!
Lerc 2 months ago
I must admit reading the abstract made me think to myself that I should read the paper in skeptical mode.
Does this extend to being able to analytically determine which concepts are encodable in one embedding but not another? An embedding from a deft tiny stories LLM presumably cannot encode concepts about RNA replication.
Assuming that is true. If you can detect when you are trying to put a square peg into a round hole, does this mean you have the ability to remove square holes from a system?
- jxmorris12 2 months ago
  
  Very fair!
  > Does this extend to being able to analytically determine which concepts are encodable in one embedding but not another? An embedding from a deft tiny stories LLM presumably cannot encode concepts about RNA replication.
  Yeah, this is a great point. We're mostly building off of this prior work on the Platonic Representation Hypothesis (https://arxiv.org/abs/2405.07987). I think our findings go-so-far as to apply to large-enough models that are well-enough trained on The Internet. So, text and images. Maybe audio, too, if the audio is scraped from the Internet.
  So I don't think your tinystories example qualifies for the PRH, since it's not enough data and it's not representative of the whole Internet. And RNA data is (I would guess) something very different altogether.
  > Assuming that is true. If you can detect when you are trying to put a square peg into a round hole, does this mean you have the ability to remove square holes from a system?
  Not sure I follow this part.
  
  2 replies →
- oofbey 2 months ago
  
  I read a lot of AI papers on arxiv, and it's been a while since I read one where the first line of the abstract had me scoffing and done.
  > We introduce the FIRST method for translating text embeddings from one vector space to another without any paired data
  (emphasis mine)
  Nope. I'm not gonna go a literature search for you right now and find the references, but this is certainly not the first attempt to do unsupervised alignment of embeddings, text or otherwise. People were doing this back in ~2016.
  
  1 reply →
billconan 2 months ago

Thank you for sharing! I have a question about embedding versioning/migration. I'm not sure if this research solves it?
Say I want to build an app with embedding/vector search. Currently, my embeddings are generated by model A, that is not open source. Later, I find a better embedding model B, and my new data will be using this model B. Since A and B are two different vector spaces, how can I migrate A to B, or how can I make vector search work without migrating A to B?
Can your research solve this problem? Also，if all embedding models are the same, is there a point of upgrading the model at all? some must be better trained than others?
logicchains 2 months ago
Does this result imply that if we had a LLM trained on a very large volume of only English data, and one trained only on a very large volume of data in another language, your technique could be used to translate between the two languages? Pretty cool. If we somehow came across a huge volume of text in an alien language, your technique could potentially translate their language into ours (although maybe the same could be achieved just by training a single LLM on both languages?).
- cubefox 2 months ago
  
  > (although maybe the same could be achieved just by training a single LLM on both languages?).
  Intuitively I assume this would work even better.
srean 2 months ago
Doesn't the space of embeddings have some symmetries, that when applied does not change the output sequence ?
For example, global rotation that does not change embedded vector x embedded vector dot-product and changes query vector x embedded dot-product in an equivariant way.
- jxmorris12 2 months ago
  
  Yes. So the idea was that an orthogonal rotation will 'encrypt' the embeddings without affecting performance, since orthogonality preserves cosine similarity. It's a good idea, but we can un-rotate the embeddings using our GAN.
  
  4 replies →
cubefox 2 months ago

> we can decode any embedding vectors, given we have at least a few thousand of them.
Do you think that could be sufficient to translate the Voynich manuscript?
chompychop 2 months ago

Don't you mean "John" instead of "Jack"? :)
SubiculumCode 2 months ago

I am a curious amateur, so I may say something dumb. but: Suppose you take a number of smaller embedding models, and one more advanced embedding model. Suppose for a document, you convert each model's embeddings to their universal embedding representation and examine the universal embedding spaces.
On a per document basis, would the universal embeddings of the smaller models (less performant) cluster around the better model's universal embedding space, in a way suggestive that they are each targeting the 'true' embedding space, but with additional error/noise?
If so, can averaging the universal embeddings from a collection of smaller models effectively approximate the universal embedding space of the stronger model? Could you then use your "averaged universal embeddings" as a target to train a new embedding model?

lpasselin 1 month ago

Hey, I read the paper in detail and presented to colleagues during our reading group.

I still do not understand exactly where D1L comes from in LGan(D1L, T(A1(u)). Is D1L simply A1(u)?

I also find that mixing notation in figure 2 and 3 makes it tricky.

Would have loved to have more insights from the results in the tables.

And more results from inversion, on more than Enron dataset. Since that is one end goals, even if reusing another method.

Thank you for the paper, very interesting!

airylizard 2 months ago

The fact that embeddings from different models can be translated into a shared latent space (and back) supports the notion that semantic anchors or guides are not just model-specific hacks, but potentially universal tools. Fantastic read, thank you.

Given the demonstrated risk of information leakage from embeddings, have you explored any methods for hardening, obfuscating, or 'watermarking' embedding spaces to resist universal translation and inversion?

jxmorris12 2 months ago
> Given the demonstrated risk of information leakage from embeddings, have you explored any methods for hardening, obfuscating, or 'watermarking' embedding spaces to resist universal translation and inversion?
No, we haven't tried anything like that. There's definitely a need for it. People are using embeddings all over the place, not to mention all of the other representations people pass around (kv caches, model weights, etc.).
One consideration is that's likely going to be a tradeoff between embedding usefulness and invertability. So if we watermark our embedding space somehow, or apply some other 'defense' to make inversion difficult, we will probably sacrifice some quality. It's not clear yet how much that would be.
- airylizard 2 months ago
  
  Are you continuing research? Is there somewhere we can follow along?
  
  1 reply →

mjburgess 2 months ago

I don't see how the "different data" aspect is evidenced. If the "modality" of the data is the same, we're choosing a highly specific subset of all possible data -- and, in practice, radically more narrow than just that. Any sufficiently capable LLM is going to have to be trained on a corpus not-so-dissimilar to all electronic texts which exist in the standard corpa used for LLM training.

The idea that a data set is "different" merely because its some subset of this maximal corpa is a difference without a distinction. What isnt being proposed is, say, that training just on all the works of scifi fiction lead to a zero-info translatable embedding space projectable into all the works of horror, and the like (or say that english-scifi can be bridged to japanese-scifi by way of a english-japanese-horror-corpus).

The very objective of creating LLMs with useful capabilities entials an extremely similar dataset starting point. We do not have so many petabytes of training data here that there is any meaningful sense in which OpenAI uses "only this discrete subspace" and perplextiy, "yet another". All useful LLMs sample roughly randomly across the maximal corpus that we have to hand.

Thus this hype around there being a platonic form of how word tokens ought be arranged seems wholly unevidenced. Reality has a "natural arrangement" -- this does not show that our highly lossy encoding of it in english has anything like a unique or natural correspondence. It has a circumstantial correspondance in "all recorded electronic texts" which are the basis for training all generally useful LLMs.

cubefox 2 months ago

This arguably means that we could translate any unknown alien message if it is sufficiently long and not encrypted: 1) Create embeddings from the Alienese message. 2) Convert them to English.

Could we also use this to read the Voynich manuscript, by converting Voynichese into embeddings and embeddings into English text? Perhaps, though I worry the manuscript is too short for that.

kevmo314 2 months ago

Very cool! I've been looking for something like this for a while and couldn't find anyone doing it. I've been investigating a way to translate LoRAs between models and this seems like it could be a first step towards that.

myflash13 2 months ago

Huh. So Plato was right. This has many implications for philosophy. Interestingly, the 12th century Platonic-influenced Arab philosopher Ibn Arabi described methods of converting text to numbers (embeddings) and then performing operations on those numbers to yield new meanings (inference). A 12th century LLM? His books are full of these kinds of operations (called Abjad math) and a core part of his textual hermeneutics.

SubiculumCode 2 months ago

Can this be used to allow different embedding models to communicate with each other in embedding space?

jxmorris12 2 months ago

Yes, you can definitely convert the outputs from one model to the space of another, and then use them.

henry-d 2 months ago

Very cool! Do you think if an alien civilization created an embedding model for their alien corpus of text it would satisfy this?

kridsdale1 2 months ago

This seems like a catastrophe in the wings for legal-services-RAG companies.