← Back to context

Comment by lispisok

2 years ago

So they generated training data from one laptop and microphone then generated test data with the exact same laptop and microphone in the same setup, possibly one person pressing the keys too. For the Zoom model they trained a new model with data gathered from Zoom. They call it a practical side channel attack but they didnt do anything to see if this approach could generalize at all

I believe that is the generalisable version of the attack. You're not looking to learn the sound of arbitrary keyboards with this attack, rather you're looking to learn the sound of specific targets.

For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.

  • Finally, a real security weakness to cite when making fun of people for their mechanical keyboard. Time to start recording the audio of Zoom calls with some particularly loud typers...

    • I used to work in an office space with an independent contractor whose schtick was that he was a genius. The affectations around his genius-ness included casually bringing up Mensa meetings, dropping magazines like Foreign Affairs and academic journals around the office, and his fucking keyboard.

      The keyboard had custom switches that were very loud. And he typed fast - it was like living on a gun range. Everyone in the office probably would have chipped in for a hitman, but alas, the CTO, whose office had a solid door, was “inspired” that the mechanical feedback helped fuel inspiration in boy wonder.

      Had we thought of the security risks of the keyboard, I would have brought good scotch to the infosec dude while expressing my concerns.

      7 replies →

    • Mechanical keyboard user here. Most of us use mechanical keyboards because they're a lot more fun to type on. That's it. Because if you're not having fun, what's the point?

      11 replies →

    • It’s so fascinating to watch this play out live. Once again, an ambitious kid can implement software hacks that are very funny when used for a joke, but also have massive real-world implications.

  • I think maybe you wouldn't even need to see the keystrokes. Given enough examples of just audio, I wonder if you could work out the keys using the statistical letter patterns in language.

  • And there are therefore millions of hours of video that could be attack surface area already in the wild

  • for a few years I've used rtx voice to remove keyboard typing and other background noise

I think this linited attack surface can work without having to generalize one model to multiple people or keyboards. One advantage of a Zoom attack is that you get “plaintext” shortly after hearing the “ciphertext” if you can get the target to type into the chat window. And when you hear typing in other contexts it’s likely to be something that matches a handful of grammars that an LLM can recognize already (written languages, programming languages, commands, calculation inputs) - and when it doesn’t, that’s probably a password.

Do keystrokes still come through Zoom? The noise filtering has become extremely aggressive lately, often hear people say “Sorry about that engine / ambulance / city noise” but nobody knows what they’re talking about.

How come keyboard sound suppression is not a standard option in all online communication apps? It’s not that hard, keyboard sounds are pretty distinct.

  • Maybe because it's easier said in an HN comment than in real life

    • VoiceMeter worked reasonably well for me after some tinkering with sliders. Nvidia RTX voice should filter that out too.

Yeah and in fact, I've heard of this attack being done in the past, but it heavily depends on the typist, the keyboard, etc. Cadence, sound, etc changes with the typist and hardware. This isn't new, and has very few, if any practical applications for wide spread replication.

The answer is that likely all the above are used.

Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.

This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.

Seems simple to defend - use a password manager.

  • until you have to type your password to unlock it

    • High security safe locks have had protection against this for a long time: you press up/down arrows to move from a random starting digit to the correct digit.

      On screen pin entry with jumbled number mappings does the same thing. It also makes the inter-stroke delay rather independent of position, because the brain has to search the screen (although repeated digits and previously occuring digits are quicker, which is why some jumble at every keystroke).

      Keyboards with OLED keys (like the Apple Touchbar or the Optimus[1]) might also work.

      [1] https://www.artlebedev.com/optimus/popularis/

    • Biometric unlock or PIN ? I have to type my master password on restart, hopefully you can do that off screen.

    • your password manager hopefully uses an additional factor to enable it on a new device, so definitely avoid typing that in on Twitch

it is definitely possible to generalise this, a couple of years ago I did the same with a pair of microphones.