← Back to context

Comment by esafak

2 years ago

Why wouldn't they? If you think of embeddings as a learned hash, and your hash space is wide enough, the embedding would simply be another, lossless representation of the input. The challenge, of course, is that inverting hashes is difficult. Except in machine learning, the hashes are typically intended not to be so; to preserve semantic and syntactic relationships, as word2vec famously demonstrated. And there are even text embeddings that use sub-word information like character n-grams, which can trivially represent rare words, like the kind embodied by personal information.

edit: Given the author agrees, I suppose the research question is how well and cheaply you can do it, across different embedding algorithms. For practitioners, the lesson is to treat embeddings like personally identifiable information, as the authors state in the conclusion.