← Back to context

Comment by godelski

5 years ago

So I guess there is more to the question that I'm asking.

> Our accuracy will not match that offered by services who index your data on their servers. But there's a trade off between user experience and privacy here,

I think most people here understand that[0]. We are on Hacker News after all and not Reddit or a more general public place. The concern isn't that you are worse. The concern is that your product has to advance and get better over time. That mechanism is unclear and potentially concerning. The answer to this is the answer to how you ensure continued privacy.

You talk about the "push files/thumbnails for indexing" and this is what is most concerning to me and at the heart of my original question. How are you collecting those photos for _your_ training set? Obviously this isn't just ImageNet (dear god I hope not). Are you creating your own JFT-300M? Where are those photos being sourced from? What's the bias in that dataset? Obviously there are questions about the model too (CNNs and Transformers have different types of biases and see images differently). But that's a bigger question of training methods and that gets complicated and nuanced fast. Obviously we know there is going to be some distillation going on.

There's a lot of concerns here and questions that won't really get asked of people that aren't pushing privacy based apps. But the biggest question is how you get feedback into your model and improve it. Non-privacy preserving apps are easier in this respect because you know what (real world) examples you're failing on. But privacy preserving methods don't have this feedback mechanism. We know homomorphic encryption isn't there yet and we know there are concerns with federated learning (images can be recreated from gradients). So the question is: how are you going to improve your model in a privacy preserving method?

[0] I think people also understand that on device NNs are going to be worse than server side NNs since there's a huge difference in the number of parameters and throughput between these and phone hardware can only do so much.

> how are you going to improve your model in a privacy preserving method

We will not improve our models with the help of user-data and will resort to only pre-trained models that are available in the public domain.