Comment by rkagerer

3 years ago

Not able to read the full paper at the moment, and confused about something:

If the server needs to go pull the article from Wikipedia, how is it blind to which one is being requested?

If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?

The only way I understand this works is if the client itself seeded the encrypted data in the first place (or at least an encrypted index if all the server pushes back is article numbers).

Maybe I'm ignorant of something; if so thanks for ELI5.

> If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?

With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.

There is no usable information to retain.

So the question becomes: what can you calculate using arithmetic on secret numbers?

Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.

The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."

Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.

  • Thank you. I really wish the paper started with exactly what you wrote as the abstract.

  • does revealing that you accessed something from the index vs nothing a leak of information?

    • The server can't tell if or which articles you accessed in a query.

      If the server has removed an article, then it's possible it would send back a broken response to you, but the server would not be able to tell if that happened. There would only be an information leak if you reacted to the broken response.

      If you mean observing that you made a query at all, versus not making a query, sure that's observable.

The server has a snapshot of Wikipedia, sitting in memory. The server is blind to which article is requested because it computes a homomorphic dot product between an encrypted one-hot vector (encrypted under a key that only the client knows) and the total set of articles in plaintext. The encrypted query does not reveal anything about the requested index (ie it is semantically secure).

The 'magic' of homomorphic encryption is that the server is able to take an encrypted one-hot vector of bits, and a vector of plaintext articles, compute a homomorphic dot product, and produce a single ciphertext encoding the single desired plaintext article. This single output ciphertext is crucially also encrypted under the client's secret key, so it reveals nothing to the server.

  • Thanks. I knew homomorphic encryption let's you do math on encrypted data without knowing its contents. But it hadn't occurred to me you can keep one side of the arithmetic in plaintext. It also helped when gojomo pointed out that every query touches every byte of the full dataset.

    So basically (if I've got this right and at the risk of over-simplifying): Client computes a bit mask and encrypts it, sends that to the server, and says "do a big multiplication of this against Wikipedia content". That leaves only the relevant portion in the encrypted result, which only the client can decrypt.

    How does the client know which bits in the initial "mask" (aka one-hot vector) to set? Does it have to download a full index of article titles? (So in a sense this is just a big retrieval system that pulls an array element based on its index, not a freeform search tool).

    • Yes, all of the mapping of article titles to index happens locally.

      If the index got too large, we could use a hash to bucket things, but this incurs some waste and would only support exact article title matches.

      Yes, it is definitely not a freeform search tool.

  • Don't you have to pad the output ciphertext size to match the largest article you could possibly request from the set of articles? Or is a fixed-size output an inherent property of homomorphic encryption schemes? Otherwise it seems to reveal something to the server just by the size of the ciphertext (since WP articles vary in size).

    • The output is always going to be fixed size.

      The nicer way to look at homomorphic encryption is like math, but the more literal way is like a bunch of logic gates. There are N output wires. You control the bits on the input wires. Nothing you do will change the number of wires.

    • Yes, every item needs to be the same size. We batch the articles into 100KB chunks, and we split articles larger than this into multiple chunks.

      2 replies →

The server already has a full copy of (its subset of) Wikipedia.

Every query touches every byte of that snapshot, in the same way... but the homomorphic-encryption math distills that down to just the fragment the client requested.

And it does so without giving even the CPU executing that math any hint of what bytes/ranges/topics survive the math. The much-smaller – but still, at every server step, encrypted – result is then sent to the client, which performs the decryption.

It's moon-math magic.