Comment by nneonneo

2 days ago

I mean, even back in 2021 the Clip model was getting fooled by text overlaid onto images: https://www.theguardian.com/technology/2021/mar/08/typograph...

That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.

The handwritten label was by far the dominant aspect of the "iPod" image. The only mildly interesting aspect of that attack is a reminder that tokenizing systems are bad at distinguishing a thing (iPod) from a refernce to that thing (the text "iPod").

The apple has nothing to do with that, and it's bizarre that the researchers failed to understand it.