Comment by Labo333

3 years ago

I understand that you do some kind of dot product (with two steps, Regev and GSW). However, it looks to me that those steps involve fixed dimension vectors.

- How do you handle variable length data? Do you need to pad it?

- What is the memory overhead of the storage of encrypted data?

I think that at least for video data, the streaming scheme "leaks" the size of the encrypted data with the number of streaming packets.

Yeah, every record needs to be the same size. For the demo, we batch the articles into 100KB chunks. We do a good job packing, where we put many small articles into a 100KB chunk, and we split large articles into many 100KB chunks. This packing is pretty efficient, roughly 90% of space is not wasted.

The memory overhead is significant but not prohibitive… we have to keep something like a 5-8x larger encoded database in memory, but this overhead is from encoding the plaintext in a special format to allow a fast dot product, not from inefficiency in the packing.

  • Is it not possible to determine which article(s) the user downloaded based on the memory locations read? Of course, multiple small articles from within the same 100KB cannot be said, but for any medium to large article, you'd be able to make a good guess (if there are a handful of articles there) or an exact match (if there is <=1 article in that chunk) no?

    Or does the server go through a large chunk of its memory (say, at least a quarter of all of Wikipedia) and perform some oblivious computation on all of that data (applying the result modulo this 100KB return buffer)? That sounds very resource-intensive, at least for something large like Wikipedia (a doctor's office with some information pages of a few KB each could more easily do such a thing).

    In the latter case, is each request unique (does it involve some sort of IV that the client can xor out of the data again) or could an index be built similar to a list of hashed PIN codes mapped back to plain text numbers?

    Edit: I had already read some comments but just two comments further would have been my answer... :) https://news.ycombinator.com/item?id=31669924

    > One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.

    That is some cool stuff indeed. I'm going to have to up my game when building or reviewing privacy-aware applications. Sure, a file sharing service is not going to practically allow this, but I'm sure that with this knowledge, I will come across places where it makes sense from both a usefulness (e.g. medical info) and practicality (data set size) perspective.

    • > Sure, a file sharing service is not going to practically allow this

      Well, as the author points out here [0], it doesn't actually translate to exorbitant cost increases when almost all the cost is in bandwidth rather than compute. A file sharing service rather seems like an ideal example (assuming you can derive additional revenue from the privacy promises).

      [0]: https://news.ycombinator.com/item?id=31673122

      1 reply →