Text embeddings reveal almost as much as text

2 years ago (arxiv.org)

One classic old estimate of the amount of information per English word, from Claude Shannon, is 11.82 bits per word. So very roughly, a 32-token text – the length they highlight to demonstrate their technique – should be representable in about 379 bits.

One of the embeddings they demonstrate the use of their technique against is the `text-embedding-ada-002` OpenAI offering, which gives back a 1,536-dimension representation, where every dimension is a floating-point number.

So as a matter of theory, just in the sign-bits of those vectors, there's enough raw state to potentially encode 32-word texts, and more.

If those float-dimensions are 4-byte floats, as are common, a single `text-embedding-ada-002` text vector takes (1536 * 4 bytes =) 6,144 bytes of storage (49,152 bits), While the dense/continuous nature of these values, and all the desirable constraints/uses packed into them, means you won't be getting that much precise/lossless text-representation from the values, it's plenty spacious for capturing a pretty-good compression of some pretty-long texts.

The interesting thing here is how often that short texts can be perfectly or nearly-perfectly recovered, via the authors' iterative method – even without that being an intended designed-in capability of the text embedding.

You could also see the result as an example of the old saw, "compression is intelligence" (or vice-versa) – an insight which has motivated, among other things, the Hutter Prize.

Some interesting questions for future work on the same codebase could be:

• at what text length does accuracy peak, and at what length does it precipitously decline?

• are even the failures still reasonable summaries of the original text?

• if for some applications such recovery is undesirable – in many it's no problem – are there ways to change the models, with different/extra training/ablation/whatever, that retains other aspects of the embeddings' usefulness but impairs verbatim text recovery?

  • Really insightful comment. You make a good point that if the embeddings were optimized for compression, we could probably store much longer texts; the interesting thing is that these embeddings are designed & trained to do something else (search / semantic similarity) yet we discover that they're still encoding the input pretty well.

    I really like your follow-up research questions, here are my thoughts in order:

    - I'm also curious at what point text becomes difficult to reconstruct. Ada claims to embed text up to 8192 tokens, which seems quite long to me. However training a decoder that can decode text that's that long would take a lot of compute so we haven't tried it yet.

    - This one we actually tested, and it's buried in the analysis section of our paper. The TLDR is that the minimum BLEU score for reconstruction is I think around 20, which is actually pretty high. So for all reconstructed inputs, we get pretty close, but miss some nouns, swap word order, and get some syntax slightly wrong which drops BLEU from 100 to 20 on that specific input and kills our exact match for some long out-of-domain texts.

    - The question of whether designing privacy-preserving embeddings is even possible is also very compelling to me, and I don't know the answer. My instinct (and what some folks have indicated on Twitter in the last 24hr) is that it's really hard to preserve privacy and search performance at the same time, and there has to be some kind of tradeoff depending on how fuzzy you're willing to make the embeddings. But I'd love to be proved wrong on this.

    • With regard to privacy-preservation, there's also potential from the oblivious-retrieval/homomorphic-encryption/zk-proof/etc worlds that one might be able to offer a frozen model/embedding-set, and prove to clients that…

      (1) their text was faithfully converted to an embedding according a disclosed technique & stable (but private) set of model-weights; and

      (2) here are the top-N neighbors doc-IDs/docs by similarity

      without either the client receiving the full model weights or even their text's precise embedding, nor the server seeing a cleartext version of the client's text or any usable info about what it contains.

    • >we could probably store much longer texts

      Or significantly reduce the # of dimensions used on an embedding.

First author here! I thought it was interesting that so many people thought this was obvious after-the-fact. What we're trying to convey is that although inversion turns out to be possible, it's not trivial.

Here are the three levels of inversion model described in the paper:

(1) Training a decoder to invert embeddings gives you some similar text, but only with some overlapping words.

(2) Training an encoder-decoder with the embedding projected up to a 'sequence' for the encoder works better, and gets more words back, but almost never gets the right text back exactly, especially when sequences are long.

(3) Training this fancy architecture as a 'correction' model that recursively corrects and re-embeds text works really well. It turns out that search is necessary for this kind of thing, and can get many texts back exactly.

A fourth option would be to simply scale the inversion model up by 10x or 100x, which would give us predictably better performance. We didn't try this because it's just not that interesting.

  • Hey man this is going to sound really weird but when I just saw your name, it struck me as super familiar. Like I know I've seen your name on GitHub somewhere. I started going through your public Repos and I figured it out!

    The first commit I ever made was on a fork of one of your projects over 5 years ago. I wasn't a software engineer then, I was just some guy learning to code in his free time.

    Anyways I just wanted to say thank you. I know it may seem silly. But your code gave me the scaffolding I needed to take some important first steps towards where I am today. I'm now a senior software engineer/team lead!

    If you were curious, here's the fork: https://github.com/brendenriggs/GottaSlackEmAll

    I've got to say: I don't play any more, but I definitely miss those early day when Pokemon Go first came out. Good times.

    • Wow, that’s really wild! Yeah, I remember this — 8 years ago Pokémon go came out and I was working on some projects to reverse engineer the API. Now I’m reverse engineering neural networks. Go figure.

      2 replies →

  • > A fourth option would be to simply scale the inversion model up by 10x or 100x, which would give us predictably better performance. We didn't try this because it's just not that interesting.

    Scaling is always both important and interesting.

    • To train a larger inversion model, I think we'll just need 16 A100s for a month or two. We can circle back in December, once the bigger model finishes training, to see that we've gotten better reconstruction performance. Fascinating!

  • Great paper! I have a question on how much this can be attributed to portions of text which GPT has already seen. Like if the same approach was followed on a new sliver of Wikipedia which GPT had not seen would the results be the same?

Why wouldn't they? If you think of embeddings as a learned hash, and your hash space is wide enough, the embedding would simply be another, lossless representation of the input. The challenge, of course, is that inverting hashes is difficult. Except in machine learning, the hashes are typically intended not to be so; to preserve semantic and syntactic relationships, as word2vec famously demonstrated. And there are even text embeddings that use sub-word information like character n-grams, which can trivially represent rare words, like the kind embodied by personal information.

edit: Given the author agrees, I suppose the research question is how well and cheaply you can do it, across different embedding algorithms. For practitioners, the lesson is to treat embeddings like personally identifiable information, as the authors state in the conclusion.

Diffusion based image generation is a kind of ‘reversing the embedding’, right?

The iterative error correction applied here almost feels like repeated noise reduction, too.

So does this mean you could use this sort of text recovery to do the same kinds of things image diffusers do, traversing the latent space and recovering images along a path - but instead recovering text that morphs?

Could we finally use this to discover the text that lies exactly halfway between the Declaration of Independence and the lyrics of ‘Baby Got Back’?

I guess not until it scales to more than 32 tokens. But we can dream.

If you want to play with a demo of this kind of thing I suggest this Colab notebook: https://linus.zone/contra-colab - via https://twitter.com/thesephist/status/1711597804974739530

It demonstrates a neat model that's specifically designed to let you embed text, manipulate the embeddings (combine them, average them or whatever) and then turn them back into text again. It's fascinating.

During the Word2Vec era, I used to average a few word embeddings to get centroid embeddings. My observation was that the average embed was still close to all the original embeddings up to 5 concepts. I tested with similarity search. Can't pack too many distinct meanings into a single embed, but you can pack a few.

My only gripe with word embeds was that they were mixing synonymy and relatedness. Even worse, mixing up synonymy with antonymy - hot and cold are similar in a way, but also completely opposite.

What is a text embedding?

  • A vector representation of a text based on its meaning, generated by an ML model. It's the data format used as input for LLMs and can be used to compare the meaning of texts by comparing the vectors with each other.

    • Is it as simple as the euclidean distances of the vectors in N-dimensional space is intended to approximate the difference in meaning between two texts?

      1 reply →

I’m sorry but the abstract doesn’t say it: What is a text embedding?

  • It's a machine learning method for turning a sample of text into a fixed-length vector. We usually try to find embeddings with the property that samples of text with similar meanings are mapped to vectors that are close to one another.

Does this mean you can store somewhat lossy text/data only as embeddings? Or is that generally a not so good idea...

  • Yes, you can do this.

    This is how you would implement a "medium term" memory. Folks in the "sentence-transformer" world have known this forever, yet the wider NLP world ignores it in the context of "chatbots" despite how powerful that and related concepts like "soft-prompts/textual inversion" are.

    It's a wonderful technique and the fact that it's not used in ChatGPT and other tools like it is a shocking shame.