← Back to context

Comment by tansan78

1 year ago

I am just thinking there might be other better ways to preserve user's search privacy: using LLM embeddings (https://en.wikipedia.org/wiki/Word_embedding).

The browser creates embeddings of user query, then send the embeddings to the server.

To complete a search, the server is a machine and it does not really need text to understand what a user want. A series of numbers, like LLM embeddings, are totally fine (actually it might even be better, because embeddings map similar words closely, like Duck and Bird have similar embeddings).

On the privacy side, LLM embeddings are a bunch of numbers. Even the embeddings are associated with a user, other people cannot make meaning out of the embeddings. Therefore the user's privacy is preserved.

What do you think?

This is an interesting idea, and indeed such an approach to providing privacy has been formalized to different degrees and varying levels of success (eg. [1][2]).

[1] https://arxiv.org/abs/1204.2136 [2] https://arxiv.org/abs/2210.03458

Unfortunately, as described, such a solution would only satisfy a somewhat meaningless notion of privacy. Specifically, the embeddings by definition contain potentially private information about the user, revealing things like "I'm asking about birds" to use your example. Even though it might "compress" the query in a slightly lossy way, it would still reveal a great deal of information about the query.

A true solution to this problem would require something like differential privacy and adding noise to the embeddings. However, the noise required would (likely) end up destroying too much information from the embedding to preserve accuracy of the LLM.

While this is a neat sounding idea, there's a few issues here:

1. Embeddings are very close to a reversible function. It's not hard to take an embedding and query back the closest query semantically from the source LLM.

2. We already don't log any queries from the users. I'm aware this has to be taken on faith from Kagi users. But you can believe that not having any user query data at all anywhere is a significant speed bump for us in feature development, bug tracking and roadmapping.