← Back to context

Comment by jsjohnst

4 years ago

> Was it an actual image of a person? Were they clothed?

Some of the false positives were of people, others weren’t. It’s not that the hashing function itself was problematic, but that the database of hashes had hashes which weren’t of CP content, as the chance of a collision was way lower than the false positive rate (my guess is it was “data entry” type mistakes by NCMEC, but I have no proof to back up that theory). I made it a point to never personally see any content which matched against NCMEC’s database until it was deemed “safe” as I didn’t want anything to do with it (both from a disgusted perspective and also from a legal risk perspective), but I had coworkers who had to investigate every match and I felt so bad for them.

In the case of PhotoDNA, the hash is conceptually similar to an MD5 or a SHA1 hash of the file. The difference between PhotoDNA and your normal hash functions is that it’s not an exact hash of the raw bytes, but rather more like the “visual representation” of the image. When we were doing the initial implementation / rollout (I think late 2013ish), I did a bunch of testing to see how much I could vary a test image and have the hash be the same as I was curious. Resizes or crops (unless drastic) would almost always come back within the fuzziness window we were using. Overlaying some text or a basic shape (like a frame) would also often match. I then used photoshop to tweak color/contrast/white balance/brightness/etc and that’s where it started getting hit or miss.