Show HN: Read Wikipedia privately using homomorphic encryption

3 years ago (spiralwiki.com)

Hi, creator here.

This is a demo of our recent work presented at Oakland (IEEE S&P): https://eprint.iacr.org/2022/368. The server and client code are written in Rust and available here: https://github.com/menonsamir/spiral-rs. The general aim of our work is to show that homomorphic encryption is practical today for real-world applications. The server we use to serve this costs $35/month!

A quick overview: the client uses homomorphic encryption to encrypt the article number that they would like to retrieve. The server processes the query and produces an encrypted result containing the desired article, and sends this back to the client, who can decrypt and obtain the article. A malicious server is unable to determine which article the client retrieved. All search and autocomplete is down locally. The technical details are in the paper, but the high level summary is that the client creates a large one-hot vector of encrypted bits (0’s except for the index of the desired article, where they place a 1) and then the server computes something like a ‘homomorphic dot product’ between the query and the plaintext articles.

I’d like to caveat that this is an in-browser demo to show it is practical to use homomorphic encryption at this scale. As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.

Happy to answer any questions!

122 comments

blintz

jerf 3 years ago

This is the first thing out of homomorphic encryption I personally have seen that seems to be in the ballpark of useful for some practical use, which is impressive. Have I missed out on any other such things of interest?

(And this is not a criticism; this is a compliment. You start so far behind the eight-ball with homomorphic encryption with regard to the resources it consumes I wasn't convinced it was ever going to be even remotely useful for much of anything. Precisely because I was so skeptical, I am that impressed to see something work this well. It's not the fastest Wikipedia mirror, but... honestly... I've been on slower websites! Websites with far less excuse.)

blintz 3 years ago
There has recently been a lot of great work on making homomorphic encryption more practical for particular applications. We definitely stand on the shoulders of all that work!
One reason homomorphic encryption has a reputation as absurdly slow is that people typically talk about “fully” homomorphic encryption, which essentially means that you can compute an arbitrarily large function on the encrypted data. This involves a very expensive process called bootstrapping, which incurs a roughly ~15 ms cost per binary operation evaluated. As you can imagine, that adds up.
This work uses “leveled” homomorphic encryption, where we only perform a function of ‘bounded’ size (that is, the homomorphic dot product). So, we do not have to perform bootstrapping, and thus avoid some of the more extreme costs.
The other reason this work is practical is that we ‘tailored’ our construction to the particular problem of private information retrieval. When people try to apply homomorphic encryption generically to their problem, they typically end up with disappointingly slow and expensive results. Cool work on better ‘FHE compilers’ so in progress, so hopefully that will also help!
- feanaro 3 years ago
  
  > can compute an arbitrarily large function on the encrypted data
  What is the meaning of the word "large" here? How are we defining the size of a function?
  
  2 replies →
kccqzy 3 years ago
You missed the widely panned Apple iCloud child sexual abuse imagery detection feature. The private set intersection is basically doing homomorphic encryption. In raising some very valid policy critiques, people forget that it's actually a nifty piece of engineering. (This is not an endorsement of that feature.) https://www.apple.com/child-safety/pdf/Apple_PSI_System_Secu...
I'm also closely working together with a team at $WORK who's using a protocol very similar to Apple's but not doing CSAM detection. We are seeing some severe pushback on this technology. I wouldn't be surprised if there are multiple homomorohic encryption based products at Big Tech that have yet to see the light of day.
- lifthrasiir 3 years ago
  
  I think it is very counterintuitive---even to professional programmers---that you can compute things without letting the computer know them. I literally had to go through the entire paper to understand that this can actually work as long as human in the loop doesn't screw up (see my summary [1], which was revised multiple times at that time).
  [1] https://news.ycombinator.com/item?id=28223141
- Kenji 3 years ago
halJordan 3 years ago
Apple's Differential Privacy uses homomorphic encryption for parts of its functioning.
- j2kun 3 years ago
  
  Citation?

gojomo 3 years ago

Interesting! But, it'd be helpful to clarify further the strength of the following claim:

> This demo allows private access to 6 GB (~30%) of English Wikipedia. In theory, even if the server is malicious, it will be unable to learn which articles you request. All article title searches are performed locally, and no images are available.

In this demo, the number of article-titles is relatively small – a few million – & enumerable.

If the server is truly malicious, and it issues itself requests for every known title, does it remain true that this "Private Information Retrieval" (PIR) scheme still gives it no hints that subsequent requests from others for individual articles retrieve particular data?

(Presumably: every request touches every byte of the same full 6GB of data, and involves every such byte in constant-run-time calculations that vary per request, and thus have the effect of returning only what each request wanted – but not at all in any way correlatable with other requests for the exact same article, from the same or different clients?)

blintz 3 years ago
Indeed! The encryptions of the client queries are fully semantically secure - under relatively solid lattice-based cryptographic assumptions, a server would need to do more than 2^128 work to recover the plaintext of someone’s query. One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.
- throwaway2016a 3 years ago
  
  > without the client’s key
  Thank you, these 4 words really helped with my understanding so I'm calling it out incase it helps others. So I was thinking, what prevents you from replaying the query and getting the same page back? But it seems the answer is: that would only produce a gibberish response because you don't have the key.
- benlivengood 3 years ago
  
  How indistinguishable is a Not Found result by the server? It seems like user behavior would be to re-request the article a second time, so the client should probably protect the user against this kind of server (which could bisect on article popularity to find the requested article in ~20 tries) by throwing up a warning about an article in the index not being retrievable.
  
  3 replies →
- danuker 3 years ago
  
  Would this be vulnerable to a side-channel attack as follows?
  1. Record what item was retrieved from disk for a query
  2. Run a dictionary through the query system, and see which item matches the record
  
  4 replies →

Canada 3 years ago

Can this be applied usefully to non-public datasets?

Would it be feasible to add some other zero knowledge proof to this that would confirm a user has paid a subscription for access? For example, if this were a news site, the user would have to prove a valid subscription to read articles, but the site would not be able to know which articles any subscriber decided to read?

If that is possible, what could the site to to prevent a paying subscriber from sharing their access to an unreasonable number of others? Would it be possible to impose a rate limit per subscriber?

blintz 3 years ago

The simplest approach would be to just charge people per-query (or charge in levels, depending on the number of queries). This could be done in the standard way (you have to log in, pay the site, and then it gives you an API key or just session token, and logs how many queries you make). I think you can avoid having to use a ZKP this way, since that will make things much more complicated and possibly costly.
phh 3 years ago

I'm no cryptographer, but it seems to me you could implement this using the same algorithm as cloudfare for tor, which generates anonymous tokens from an adhoc webpage

jl6 3 years ago

In another comment you’ve said:

> With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)

Is this technique therefore practical only when the server side dataset is relatively small (or full scans for every query are tolerable)?

(edit: sorry, misattributed the quote)

blintz 3 years ago
Wasn’t me, but it was accurate!
Indeed, except in some (exciting!) new theoretical constructions, server work is always linear in the database size.
However, I’d emphasize that our work shows that the constabt factor on this linear operation is really high. We process at 300MB/s to 1.9GB/s on a single core, which is fast enough for relatively large databases. Remember that the computation is embarrassingly parallel, so you can always throw more compute at larger databases. To summarize, we think the speed is now fast enough that it really can be feasible to scan the whole database to answer a single query.
- teraflop 3 years ago
  
  > except in some (exciting!) new theoretical constructions
  Sorry if this is too much of a tangent, but I would love to know what these are!
  
  2 replies →
teraflop 3 years ago

Clarification: that was my comment, not OP's. I'm not a cryptography expert, just an interested amateur. But my understanding is that O(n) query times are inevitable if you want information-theoretic security. Maybe it's possible to do better with a weaker security property.
And there are clever ways you can make a system like this "scale" even if the overall dataset size is limited. For instance, the authors cite another interesting paper [1] that uses a similar technique to build a fully-private voice chat system. The basic idea seems to be that you build a "database" consisting of the most recent snippet of audio from every ongoing conversation, and let each client privately query for the segment that's addressed to it. And every fraction of a second when new audio data arrives, you just throw away the old database and build a new one, so the amount of data doesn't depend on the length of the conversation.
Even if this doesn't scale to arbitrary numbers of users, it could still be used to run a "cell" of a few thousand people, in which it's not possible for an adversary to reconstruct communication patterns within the cell.
[1]: https://www.usenix.org/conference/osdi21/presentation/ahmad
wbeckler 3 years ago
Maybe the I/O pattern could be hideable using confidential computing, like with a Nitro Enclave.
- teraflop 3 years ago
  
  Maybe, but if you have a secure enclave that can be trusted not to leak data, then you don't really need PIR. You can just have clients encrypt their queries with a key that can only be decrypted by the code running in the enclave.

0cVlTeIATBs 3 years ago

Could this be used for DNS?

blintz 3 years ago
This is a great idea, and we think it would be relatively practical assuming some aggressive caching. However, I couldn’t think of a threat model where this is useful, since presumably your ISP can in the end always see which sites you visit by simply reversing the IPs you connect to.
Do you think that people would want private DNS? I suppose it would still be an improvement over the what we have today, but I’m not sure that it will make it meaningfully harder for ISPs to collect and sell data to advertisers.
- 0cVlTeIATBs 3 years ago
  
  On threat models, a malicious DNS server might also be one compromised by a party demanding wiretap access.
  Regardless, a person today has a choice of which DNS server to use but they all could track the requests made. Tracking site visits via IP is a different link in that chain.
  Would people pay? I don't know, but I could see it being a feature used to different a VPN service from its competitors.
  
  1 reply →
- miohtama 3 years ago
  
  With Encrypted Client Hello (ECH) https://en.wikipedia.org/wiki/Server_Name_Indication And large security proxy networks like CloudFlare, ISP cannot no longer know sites only by IP address/tapping traffic.
- captn3m0 3 years ago
  
  OCSP would be a good target in the similar space: https://en.m.wikipedia.org/wiki/Online_Certificate_Status_Pr...
  
  2 replies →
- gojomo 3 years ago
  
  There are ways to hide IP endpoints from relaying nodes, as well.
freemint 3 years ago

Yes but it would be quiet expensive infrastructure wise. Also i think numbers of website grows faster then processor io meaning one would need more and more processors with time.

mihaitodor 3 years ago

Last year, there was a detailed presentation with several speakers on state of the art Secure Multi-Party Computation for practical applications in healthcare, fighting financial crime and machine learning from CWI (Centrum Wiskunde & Informatica) Netherlands. The recording is here (2,5h): https://www.youtube.com/watch?v=gE7-S1sEf6Q

JanisErdmanis 3 years ago

> A malicious server is unable to determine which article the client retrieved.

This sounds like magic :O. How does it behave when new articles (elements) are added, does it need to rebuild the whole database and distribute new parameters?

I wonder how practical it would be for clients to synchronize content without server not being able to deduce the synchronization state at which the client is.

blintz 3 years ago

It does sound like magic! This is what got me into this field; it seems like something that should intuitively be impossible… but it’s not!
Parameters only need to be changed based on the number of items in the database (not the content of the items). Also, they don’t really need to be modified as long as we are within the same power of two number of items. So, I think client and server agreement on parameters seems feasible. Right now, it’s just hard coded :)

raxxorraxor 3 years ago

Does homophobic in this case mean that I can edit the content of an article and the diff is directly applied to the crypt?

teraflop 3 years ago
This has nothing to do with editing Wikipedia. The problem this demo is solving is "private information retrieval" -- that is, you send a query for a particular article, and the server sends back a response containing that data, but the server does not learn which article you asked for. Homomorphic encryption is just one of the building blocks used to achieve this.
A trivial solution to this problem would be for the client to just download a copy of the entire Wikipedia database and pick out the particular article it's interested in. With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information) but the amount of information that needs to be sent to the client is much smaller.
- raxxorraxor 3 years ago
  
  Ah, I understand. I thought it was the usual presented use case to apply an operation to a crypt directly and was confused since the title already stated otherwise.
- calvinmorrison 3 years ago
  
  Unfortunately, it's not that trivial. I tried a few offline wikis with varying success, but haven't had a ton of success using it day to day.
  
  1 reply →
ReaLNero 3 years ago
Quite the curious typo you have there.
- jesushax 3 years ago
  
  I wouldn't be surprised if they originally wrote it correctly and a phone's autocorrect "fixed" it for them
  
  1 reply →

syrrim 3 years ago

What is the maximum throughput the server can maintain? Or, in other words, how much does it cost per query?

blintz 3 years ago
The $35/month server uses 6 cores to answer a single query in ~2.5-3 seconds. So it’s 0.33 QPS :-)
Not high, which is why it might not be working well for folks right now…
Time scales almost perfectly linearly with cores, since the computation is embarrassingly parallel.
In terms of cost, we’re still talking only 18 CPU•s and ~300KB of outgoing bandwidth, which is not a ton at todays prices.
- IshKebab 3 years ago
  
  It's embarrassing parallel... but you also do N times more work than a non-homomorphic system so that's not saying much!
  This doesn't seem like a particularly compelling application - can you give some practical problems that homomorphic encryption solves. I've always heard vote counting as the example.
  
  1 reply →

f38zf5vdt 3 years ago

Extremely cool. Now we can serve content without any ability to observe what people are being served exactly. I was hoping that someday soon such technology could be used to serve search results and give us a truly private search engine experience.

ajconway 3 years ago

Theoretically, can this scheme be turned into a generic O(N) key-value retrieval for non-static content (in this example — supporting adding, removing and replacing articles without re-encrypting the whole database and re-sending the client setup data)?

blintz 3 years ago

We never encrypt the database. Only the query is encrypted. The client setup data is only dependent on the client key and the size of the database (not the content). Adding and replacing articles can happen whenever the server wants, and clients do not need to "re-sync" or something like that.
For arbitrary key-value retrieval, a hashing scheme would work pretty well, modulo some imbalances that will occur.

rkagerer 3 years ago

Not able to read the full paper at the moment, and confused about something:

If the server needs to go pull the article from Wikipedia, how is it blind to which one is being requested?

If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?

The only way I understand this works is if the client itself seeded the encrypted data in the first place (or at least an encrypted index if all the server pushes back is article numbers).

Maybe I'm ignorant of something; if so thanks for ELI5.

Dylan16807 3 years ago
> If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?
With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.
There is no usable information to retain.
So the question becomes: what can you calculate using arithmetic on secret numbers?
Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.
The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."
Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.
- dheera 3 years ago
  
  Thank you. I really wish the paper started with exactly what you wrote as the abstract.
- barkingcat 3 years ago
  
  does revealing that you accessed something from the index vs nothing a leak of information?
  
  1 reply →
blintz 3 years ago
The server has a snapshot of Wikipedia, sitting in memory. The server is blind to which article is requested because it computes a homomorphic dot product between an encrypted one-hot vector (encrypted under a key that only the client knows) and the total set of articles in plaintext. The encrypted query does not reveal anything about the requested index (ie it is semantically secure).
The 'magic' of homomorphic encryption is that the server is able to take an encrypted one-hot vector of bits, and a vector of plaintext articles, compute a homomorphic dot product, and produce a single ciphertext encoding the single desired plaintext article. This single output ciphertext is crucially also encrypted under the client's secret key, so it reveals nothing to the server.
- rkagerer 3 years ago
  
  Thanks. I knew homomorphic encryption let's you do math on encrypted data without knowing its contents. But it hadn't occurred to me you can keep one side of the arithmetic in plaintext. It also helped when gojomo pointed out that every query touches every byte of the full dataset.
  So basically (if I've got this right and at the risk of over-simplifying): Client computes a bit mask and encrypts it, sends that to the server, and says "do a big multiplication of this against Wikipedia content". That leaves only the relevant portion in the encrypted result, which only the client can decrypt.
  How does the client know which bits in the initial "mask" (aka one-hot vector) to set? Does it have to download a full index of article titles? (So in a sense this is just a big retrieval system that pulls an array element based on its index, not a freeform search tool).
  
  1 reply →
- philipkglass 3 years ago
  
  Don't you have to pad the output ciphertext size to match the largest article you could possibly request from the set of articles? Or is a fixed-size output an inherent property of homomorphic encryption schemes? Otherwise it seems to reveal something to the server just by the size of the ciphertext (since WP articles vary in size).
  
  4 replies →
gojomo 3 years ago

The server already has a full copy of (its subset of) Wikipedia.
Every query touches every byte of that snapshot, in the same way... but the homomorphic-encryption math distills that down to just the fragment the client requested.
And it does so without giving even the CPU executing that math any hint of what bytes/ranges/topics survive the math. The much-smaller – but still, at every server step, encrypted – result is then sent to the client, which performs the decryption.
It's moon-math magic.

yarg 3 years ago

Can this functionality be implemented as a peer-to-peer (or federated) service?

I'm assuming it'll depend on breaking down questions into hierarchical sub-questions that can either be recomposed locally or in another homomorphic context. But can that sort of thing be done without data-leaks, or prohibitively expensive inter-node communication?

Are there any introductory resources (that you know of) on homomorphic encryption and compute that'll turn this into less of a mind-fuck?

px43 3 years ago

NuCypher is doing some really fun stuff in this space: https://www.nucypher.com
Basically, you upload encrypted blobs to a P2P network, and you can issue special proxy-re-encryption keys to people that allows them to download the encrypted content, without revealing any of the content to people running the nodes (who store and replicate the data). You can also do really interesting things like revoking keys to remove access for people who haven't downloaded the blob yet.
blintz 3 years ago

It would be very cool to federate this! I am particularly interested in applications to DHTs/Kademlia, especially for things like IPFS.
Since this has only recently become practical, there is a bit of dearth of resources on it at the moment. I’m going to try to do my part and write a blog post at some point.

Labo333 3 years ago

I understand that you do some kind of dot product (with two steps, Regev and GSW). However, it looks to me that those steps involve fixed dimension vectors.

- How do you handle variable length data? Do you need to pad it?

- What is the memory overhead of the storage of encrypted data?

I think that at least for video data, the streaming scheme "leaks" the size of the encrypted data with the number of streaming packets.

blintz 3 years ago
Yeah, every record needs to be the same size. For the demo, we batch the articles into 100KB chunks. We do a good job packing, where we put many small articles into a 100KB chunk, and we split large articles into many 100KB chunks. This packing is pretty efficient, roughly 90% of space is not wasted.
The memory overhead is significant but not prohibitive… we have to keep something like a 5-8x larger encoded database in memory, but this overhead is from encoding the plaintext in a special format to allow a fast dot product, not from inefficiency in the packing.
- lucb1e 3 years ago
  
  Is it not possible to determine which article(s) the user downloaded based on the memory locations read? Of course, multiple small articles from within the same 100KB cannot be said, but for any medium to large article, you'd be able to make a good guess (if there are a handful of articles there) or an exact match (if there is <=1 article in that chunk) no?
  Or does the server go through a large chunk of its memory (say, at least a quarter of all of Wikipedia) and perform some oblivious computation on all of that data (applying the result modulo this 100KB return buffer)? That sounds very resource-intensive, at least for something large like Wikipedia (a doctor's office with some information pages of a few KB each could more easily do such a thing).
  In the latter case, is each request unique (does it involve some sort of IV that the client can xor out of the data again) or could an index be built similar to a list of hashed PIN codes mapped back to plain text numbers?
  Edit: I had already read some comments but just two comments further would have been my answer... :) https://news.ycombinator.com/item?id=31669924
  > One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.
  That is some cool stuff indeed. I'm going to have to up my game when building or reviewing privacy-aware applications. Sure, a file sharing service is not going to practically allow this, but I'm sure that with this knowledge, I will come across places where it makes sense from both a usefulness (e.g. medical info) and practicality (data set size) perspective.
  
  2 replies →

throwaway81523 3 years ago

If you say a malicious server can't determine which article was retrieved, is that private information retrieval (PIR)? Something must be different here. I thought there was a theorem that for single-server PIR to work, the client has to download the entire DB, which is the right way to read Wikipedia privately anyway.

blintz 3 years ago
This is PIR. You do have to download the whole database for information-theoretic security, but not for computational security. If you assume the hardness of some problem (in this case, lattices, but it is also possible from RSA, ECC, etc) it is possible to do much better than simply downloading the entire database.
- throwaway81523 3 years ago
  
  Does the server have to scan the whole database on every query? If not, doesn't the disk access pattern tell you what the query was? I had thought you had to download the whole DB even for computational PIR, but hmm, maybe not in some cases where there is only 1 client with a secret.
  
  1 reply →

j2kun 3 years ago

Do you have a blog or Twitter? I'd like to keep up with any other cool projects you're working on!

blintz 3 years ago

Not at the moment, but will probably make a blog post or something to explain how it all works at a slightly higher level than the paper.
When it’s done it’ll be at https://samirmenon.com/.

iFire 3 years ago

I wonder if this can be done on sqlite?

https://news.ycombinator.com/item?id=28012829

lucb1e 3 years ago

You mean distributing 43GB of data to everyone rather than uploading a few megabytes of data one time and getting 250KB answer chunks? I'm not sure it compares.
rakoo 3 years ago

If you're going the sqlite route, the much easier path is to distribute the whole database and do everything on the client

cobbzilla 3 years ago

Fantastic project.

Have you considered running (# of cpus) parallel scanners continuously? An inbound query “hops on” the the least-loaded scanner; at each article/chunk the scanner runs all the queries; each query “hops off” and returns after it has completed the cycle through the entire DB.

blintz 3 years ago

I do indeed use all 6 cores of my server simultaneously, and the caching on the database part of memory is quite effective. If I had a little spare cash I might just purchase some more CPUs :-)

fragmede 3 years ago

Well but you get into the security space and license your server db technology for shit like IoT lights. I don't want the company knowing if my lights are on or off, but if they had a homomorphic encrypted backend and app, I might trust it.

fragmede 3 years ago

To get press and earn legitimacy, make a hacking challenge: put up pcaps and a $1 million prize if anyone can break it.

eternityforest 3 years ago

This is wonderful! I've never seen anything like this in practical form.

I hope it doesn't become standard practice for general websites(As I imagine some would like to see), but it's an amazing tool and there will probably be many wonderful uses.

nixpulvis 3 years ago

This kind of stuff gives some of the best arguments for open source software (OSS) to date. Otherwise, it has to be taken completely on faith, which then defeats nearly the entire purpose and makes the performance overhead untenable.

blintz 3 years ago

Homomorphic encryption actually causes some interesting asymmetries here. It is very important to have an open source client, since of course a malicious client could trivially leak your desired index. However, the server’s operation is totally untrusted… so it’s actually not important for the server to be open source. It is still nice for all the regular reasons open source is good, but interesting to note.

sedatk 3 years ago

> As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.

Arguably, a malicious server could deliver a bad executable too.

blintz 3 years ago

Yeah, I mean you generally have to go with trust on first use at some point. You can also do code signing, check hashes, build from source, compare multiple sources, etc. All the standard software supply chain security measures.

dorgo 3 years ago

Idea: Apply this to personalized advertising. Client sends his interests + habits + personal info encrypted to the server. Server finds and sends back to client the best ad based on clients info.

barbazoo 3 years ago

Can anyone recommend an explanation of this concept geared towards people with only a superficial knowledge of encryption?

This seems to be some kind of search applied on an encrypted dataset, is that right?

extr 3 years ago

It's like, I send the server an encrypted math problem. The server has no idea what the math problem is, but homomorphic encryption allows it to compute an (encrypted) result and send that back to me. I get the result and decrypt it for the correct answer. It's novel because you don't have to trust the server with your math problems.
woojoo666 3 years ago

A fairly laymans explanation was posted elsewhere in the thread: https://news.ycombinator.com/item?id=31671914
Quoting it here for convenience:
> With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.
> There is no usable information to retain.
> So the question becomes: what can you calculate using arithmetic on secret numbers?
> Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.
> The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."
> Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.
If you found the explanation useful, you can upvote the original comment linked above
SilasX 3 years ago
For homomorphic encryption in general I wrote this blog post to make the idea accessible:
http://blog.tyrannyofthemouse.com/2013/05/i-added-your-numbe...
- Engineering-MD 3 years ago
  
  I find it deeply ironic that the page is http not https!

badrabbit 3 years ago

Very nice! Great against snoopers that lack authority but for when they do have some authority (bosses, government) without plausible deniability it can do more harm than good.

axg11 3 years ago
Can you explain this comment more? Genuinely asking as I don’t understand the implication/downside?
- badrabbit 3 years ago
  
  The fact that you are evading surveillance will get you in trouble, on top of that false conclusions about what you were doing could be made. Maybe you were browsing academic content from Iran an example, you get in trouble for using it but also for browsing anti-islam content which you can't disprove but your accusers can make arbitrary false claims by correlating things about you with your evasive actions. Plausible deniability means generation of forensic evidence that serves as a cover. In my example it would be generating traffic that decrypts as harmless content or using stego to hide the real content inside cover content. And bundling the software as a feature of some other browser or extension that has other purposes that allow you to say "I just installed a pdf converter extension, I didn't know it let me read wikipedia with homomorphic crypto as well".
  
  1 reply →

sizzle 3 years ago

This sounds like the ultimate anti-user profiling and targeted advertising solution. I hope google and other advertising giants can’t stop this. Thoughts?

dontbenebby 3 years ago

This is very cool OP! I interviewed to be a privacy engineer with Wikimedia a while back.

I suggested that my goal would be to add a v3 onion service. They actually had listed years of "homomorphic encryption" as a requirement. I phoned up the recruiter and basically said it's ok if there is a personality conflict, but the role as written was impossible to fill, and it scared me that very good suggestions for privacy as well as the health of the Tor network were discarded.

(If you set up a dot onion, that frees up traffic on exit nodes, whose capacity are limited.)

Big thanks to the OP for being willing to share this work, it's very cool and I'm about to read your eprint.

I'm excited about the potential of homomorphic encryption, though I worry about things like CPU cost -- I recall when folks had to really be nudged not to encrypt huge blocks of data with PGP, but instead use it to encrypt the passphrase to a Truecrypt volume using a symmetric cipher like AES.

(I'd love how to know we got to a point Twitter added an onion service then banned me, but Wikipedia continues to not even support MFA for logins -- I recently registered an account intending to eventually upload some art to the commons, but the perpetual refusal to allow folks to make healthy choices disturbs me.

In fact, after reading articles like these ones[1][2], it makes me question the integrity of the folks I interacted with during the interview process.

On my end, it was especially disturbing since prior to enrolling in my PhD, the alternative path I discussed was becoming an FBI agent focused on counter intelligence in the "cyber" realm.

The agent I spoke with told me I'd serve "at the needs of the bureau", so that would mean probably not using my computer skills, which would then languish, then after a couple years I might still not get my desired position, and gave me a card, which I eventually lost.

Years later, prior to the insurrection, I had to walk down to Carnegie Mellon and ask if anyone had his contact information, and was shocked that folks refused to even point me at a link to the lecture, which had been listed as open to the public.

I'm someone who reads Wikipedia, not really edits, but the vast majority of users are readers not editors, and this perpetual pattern of refusing to enable privacy enhancing technologies, paired with using privileges access to make hiring decisions against folks who lack the physical ability to make good privacy decisions offended me on a deep, personal level, and is why I often post in brash, erratic manner.

Because I see zero incentive to stay silent -- if I'm quiet, people will slowly drain my bank account.

If I post, there is a chance someone will see what I say, notice my skills, and offer full time employment. So I have to continue risking offending folks until I find a full time job, which I have not had since I left the Center for Democracy and Technology under duress following a series of electronic and physical attacks, paired with threats and harassment by staffers in the organization.

TL;DR: Great research, but I hope they also add an onion service rather than jump straight to using this :-)

[1] https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@list...

[2] https://slate.com/technology/2021/10/wikipedia-mainland-chin...

MauranKilom 3 years ago
I tried, but I simply can't follow your train of thought. You keep going back and forth between criticizing Wikimedia hiring and technology choices, advertising yourself, and deliberating over onion services. And it all seems extremely tangential to the article (which is really not about the future of Wikipedia, or Tor, or your career).
- dontbenebby 3 years ago
  
  My bad!
  I worry they have insider threat issues that remain unsolved.
  I hope they add an onion service.
  I think the tech is cool, but issues about untested code aside, I worry about CPU overhead.
  (However, on the last point, I suspect much like when we worried about CPU overhead,
  I often feel like I have to speak exhaustively and at length to get my point across, but that may be a side effect of a large chunk of my professional network being K Streeters - they like to misunderstand on purpose then complain you explained things at length.
  Is the above better? I can skip marketing myself if that's the issue - I just notice a persistent issue that folks say they need certain skills, I know I have them, but folks disbelieve me. Short of being arrested for a CFAA violation I'm not sure how to prove it to those types at this point, and I don't intend on doing that, LOL.
  If you actually care, it's painful to see people destroy the things you care about.

ddjsn111 3 years ago

How does the server select the article in a way that we can be sure they don't record the article sent back? Are the articles encrypted on the server too?

sp332 3 years ago
Yes, in fact the article number is not even decrypted by the server, so the server doesn't know which article you asked for!
- kaoD 3 years ago
  
  How is this not vulnerable to side-channel attacks like disk-access patterns?
  Could I, as a malicious server, request myself a target article and correlate that with legitimate user requests?
  
  1 reply →

ArkaneSkye 3 years ago