Comment by amatecha
1 month ago
I read that. It doesn't actually explain how the server can tell me the Eiffel Tower is in my photo without knowing it's telling me that. It glosses over the mechanism by which it can tell me what's in my photo without the service itself knowing. Yeah, cool, never-decrypted. So how do they match up? An ultra-abstract "A[encrypted] + B[encrypted] = A+B[encrypted]" doesn't tell me anything.
As an aside, your unfounded derision about "RTFA" is unwarranted and unnecessary. That was an unusually-hostile response for like, just talking about things.
The server doesn't tell you (and doesn't know) that it's the eiffel tower. Only when the result comes back to your device, is your device able to identify it as such.
I'll take it as if you don't know what homomorphic encryption means.
caveat: this is not rigorous math, just a rough presentation for one to wrap their head around the keystone principles of it
Say you have a space like natural numbers N and a computation function f: N->R. I'm just using R - as in result - so that it's not confusing below:
Take a n in N, f(n) = r, r in R
That's nice, we compute something, but the "computation place" is going to know both n and r.
Now imagine you have an encryption function g that maps every n in N to another integer. I'll just mark encrypted things with ' e.g the encrypted space is N' for easier trackability, so g: N->N' is encryption and its inverse g': N'->N is decryption.
So take a n in N, g(n) = n' in N', and g'(n') = n back again.
That's basic encryption.
Now back to making f useful: the idea is to have an encryption scheme that allows performing a certain set of operations on the encrypted space and get an encrypted result that results in an identical result once decrypted. From these operations we can have a f' function so that:
for all n in N, f(n) = g'(f'(g(n)))
Visually:
The whole point is that cheap f and f' happen locally and intensive g' can happen in untrusted places, because all that is seen is an encrypted input n' and encrypted result r'. Only thing it can see is at best that there's A match but not what THE match is, because that all just looks like random numbers; even the no-match case could very well be encrypted and in that case it doesn't even see that.
That is the principle of "homomorphic encryption", and it carries the intrinsic property of the computation place not knowing a single thing about neither the input or the output.
> It doesn't actually explain how the server can tell me the Eiffel Tower is in my photo without knowing it's telling me that.
So it depends on what you mean by "explain how":
- "homomorphic encryption" is the high level answer, described and provided in the Apple article
- if that is not a sufficiently detailed answer then you gotta roll your sleeves up and dive much deeper: the swift-homomorphic-encryption implementation, the BFV paper, the Wally paper, and the other bits that help prevent accidental side channel leaking are all both superficially discussed in a synthetic way and deep-linked in the Apple article in "would you like to know more" fashion.
Or maybe you wanted to know how they performed the actual match (g') in the encrypted space?
> With PNNS, the client encrypts a vector embedding and sends the resulting ciphertext as a query to the server. The server performs HE computation to conduct a nearest neighbor search and sends the resulting encrypted values back to the requesting device, which decrypts to learn the nearest neighbor to its query embedding
Which is possible because they
> have implemented the Brakerski-Fan-Vercauteren (BFV) HE scheme, which supports homomorphic operations that are well suited for computation (such as dot products or cosine similarity) on embedding vectors that are common to ML workflows.
My apologies for the harsh words, they were not meant as an ad hominem, more like a general ranty reflection on the state of the web society.