Comment by esafak

2 years ago

Why wouldn't they? If you think of embeddings as a learned hash, and your hash space is wide enough, the embedding would simply be another, lossless representation of the input. The challenge, of course, is that inverting hashes is difficult. Except in machine learning, the hashes are typically intended not to be so; to preserve semantic and syntactic relationships, as word2vec famously demonstrated. And there are even text embeddings that use sub-word information like character n-grams, which can trivially represent rare words, like the kind embodied by personal information.

edit: Given the author agrees, I suppose the research question is how well and cheaply you can do it, across different embedding algorithms. For practitioners, the lesson is to treat embeddings like personally identifiable information, as the authors state in the conclusion.

1 comment

esafak

srush 2 years ago

Agreed. Embeddings are pretty big > 1024 * 4 bits. And language is really small: ~1 bits per character. So it's not at all crazy that embeddings can be lossless. The paper shows a practical method to recover the text and shows how it applies generally. Here's a demo of how it works https://twitter.com/srush_nlp/status/1712559472491811221

(paper author)