Comment by nine_k

8 hours ago

Isn't it the "adversarial image" attack, well-known in (earlier) visual recognition models [1]? That would be a quite obvious vector.

[1]: https://www.science.org/content/article/turtle-or-rifle-hack...

1 comment

nine_k

dijksterhuis 8 hours ago

In general, if you zoom all the way out, yes the high level optimization problem is very similar. find some `delta` where `target_y = model_inference(delta + x)` where `target_y != real_y` and `size_of(delta) < threshold`

But (1) older audio models typically used different architectures like RNNs (Recurrent networks) which came with additional challenges compared to the CNNs (Convolutional networks) that image models used. e.g. the exploding gradients problem. during training of RNNs vanishing gradients are a potential problem. during advex optimization the problem gets inverted and you have to do different things to solve it.

Also (2) the human stuff related to imperceptibility is very different with audio. Ears vs eyes.

So, they're the same, but different.

source -- this is what my (unfinished) phd was on. i should really write up the attack that i crafted, but never got published :(