So they generated training data from one laptop and microphone then generated test data with the exact same laptop and microphone in the same setup, possibly one person pressing the keys too. For the Zoom model they trained a new model with data gathered from Zoom. They call it a practical side channel attack but they didnt do anything to see if this approach could generalize at all
I believe that is the generalisable version of the attack. You're not looking to learn the sound of arbitrary keyboards with this attack, rather you're looking to learn the sound of specific targets.
For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.
Finally, a real security weakness to cite when making fun of people for their mechanical keyboard. Time to start recording the audio of Zoom calls with some particularly loud typers...
I think maybe you wouldn't even need to see the keystrokes. Given enough examples of just audio, I wonder if you could work out the keys using the statistical letter patterns in language.
I think this linited attack surface can work without having to generalize one model to multiple people or keyboards. One advantage of a Zoom attack is that you get “plaintext” shortly after hearing the “ciphertext” if you can get the target to type into the chat window. And when you hear typing in other contexts it’s likely to be something that matches a handful of grammars that an LLM can recognize already (written languages, programming languages, commands, calculation inputs) - and when it doesn’t, that’s probably a password.
Do keystrokes still come through Zoom? The noise filtering has become extremely aggressive lately, often hear people say “Sorry about that engine / ambulance / city noise” but nobody knows what they’re talking about.
How come keyboard sound suppression is not a standard option in all online communication apps? It’s not that hard, keyboard sounds are pretty distinct.
Yeah and in fact, I've heard of this attack being done in the past, but it heavily depends on the typist, the keyboard, etc. Cadence, sound, etc changes with the typist and hardware. This isn't new, and has very few, if any practical applications for wide spread replication.
Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.
This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.
I did a similar acoustic side-channel attack as final year project at uni. There's a treasure trove of findings in this area, I'm just waiting for someone to combine methodologies. There are pretty good results using geometric models, trained and untrained statistical models like this and others, and combining these features with assorted language models.
Here's a few random papers I read along the way:
https://doi.org/10.1007/s10207-019-00449-8 - SonarSnoop, which uses a phone's speaker to produce ultrasonic audio that can be used to profile the user's interaction (e.g. entering swipe-based passcodes).
https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf - "Timing Analysis of Keystrokes and Timing Attacks on SSH", a paper from 2001 that uses statistical models of keystroke timings to retrieve passwords from encrypted SSH traffic.
https://doi.org/10.1145/1609956.1609959 - "Keyboard acoustic emanations revisited", which uses hidden Markov models and some other English language features to recover text based on classification via cepstrum features.
https://doi.org/10.1145/2660267.2660296 - "Context-free Attacks Using Keyboard Acoustic Emanations" which uses a geometric approach, using time-difference-of-arrival to estimate physical locations probabilistically.
I'm not clear why people are poo-pooing this as if it's not a big deal. From a security and espionage point of view this is pretty significant - the audio learning has got to the point that a sensitive audio bug can bascially be key logger. There are a ton of context where an audio tap would be much easier to get in place than a traditional network attack (and with modern shotgun mics, might not even require being in the building). That is applicable to much more than just password stealing.
I've always been a bit fascinated by this attack vector and wondered if would get to this point.
I wonder if playing the typing sound constantly could help. Not an abstract sound, but recording of your actual typing on this particular keyboard, mixed to play some realistic-sounding phrases / sequences. It should pause for a split second to let your actual keystrokes mix in. That would be really hard to decipher, or to correlate your typing with whatever other events (time to enter a password).
Better yet, play some white noise around you. I heard that it's actually done sometimes at really important meetings.
If you're not such a VIP, just type important things only on your phone; touch screens don't produce enough sound, hopefully.
Fascinating. I'm really curious what the acoustic properties are that it's recognizing.
Is it more of a physical fingerprint of each key, such that if you swapped keys/springs the model would need to be updated? So it's produced by manufacturing inconsistencies, the way individual typewriters used to be forensically identified?
Or is more each key being identical, but producing a different resonance pattern within the keyboard/laptop due to the shape of all of the matter surrounding it? If you move the keyboard in the room, do you have to re-train the model?
I also wonder how much it varies depending on how hard you press each key -- not at all or a great deal? And what about by keyboard -- when you compare thin MacBook keys with an external full-height keyboard, is one easier/harder to recognize each key on than the other?
Building on what you said: (1) just the key's properties; (2) key properties relative to other keys; (2) sound transmission and environment between key and microphone; (3) relationship between key and finger; (4) relationship between key and associated dendritis
By the way, some (most?) videoconferencing software removes keyboard sounds from the audio, because it's particularly a distracting problem with laptops where the microphone is right next to the keys.
I'm pretty sure Zoom does this by default as part of its noise cancellation (it's potentially even easier since you can use keydown events to help identify, not just the audio stream).
So as long as basic default noise cancellation is on, that would at least prevent this over regular videoconferencing. And because of this, I'm having a hard time thinking of when else this would be a realistic threat, where the attacker wouldn't already have enough physical access to either install a regular keylogger or else a hidden camera.
Teams definitely don't have this, at least not by default, or not by default in our corp. Anytime somebody on the call starts typing you hear it very clearly.
The example figure shows a key hit every half second, which suggests a pecking style of typing at around 24 wpm. This way the model gets very clean waveforms. I wonder how their approach would work with average or fast typists. The sound profiles might be much harder to link to characters.
Even if there was ambiguity, some data is better than none. Given enough training data, I suspect you could find repeatable patterns in standard typists: on a qwerty layout, after typing an "A", "Q" takes 1.2-2.3x as long to type as a "J" kind of pairwise tempo patterns. Anything to reduce the search space from brute-forcing every candidate character.
Even better if the target uses a passphrase, "hXXXse battXXX stXXXXX cXXXXXX" becomes interpretable given a few landmark letter identified with high probability.
In response to this post, I just open sourced a starter project to a variation of this idea: https://github.com/secretlessai/audio-mnist. I've been interested in doing image classification techniques like CNN on audio data for a while.
A couple years ago for a weekend project I made a simple "audio-mnist" dataset from handwritten digit audio recordings. I never got past a few days worth of work, but open-sourcing it has been on my mind for a minute. This post kicked me into action. Getting some more data, basic CNN examples, etc. could provide a nice starting point for a lot of research and tools.
There is still separate code I'd have to find and make intelligible to create the recordings and split the audio.
Anyway, in case anyone finds part of this process interesting or useful.
Some old TV remotes used to work this way. They were made by Zenith and are called Space Command remotes. Apparently they are the reason TV remotes are sometimes called clickers.
I've never considered how odd clicker is for remote but it feels totally natural to me. Like something my parents or grandparents would say. Never thought about where it came from.
Imagine the UX of 1 in 20 characters typed being incorrectly inferred though. The P_failure*Cost impact would strike me as insufferable even if error rate were to improve by an order of magnitude.
Text-to-keystroke-audio where the text comes from the LLM Prompt "fanfiction based on HGTV's Love It or List It starring an Ewok realtor and Klingon interior designer in iambic pentameter".
The goal is to cause the eavesdropper to totally reevaluate their life choices, and maybe even get caught up in the story.
Whereas for practical security, having some common substring in all your passwords that you don't type but insert through some global hotkey would be just fine as a mitigation against eavesdrop attacks.
Yes, that's also obscurity, but obscurity is actually good - it only got a (deservedly) bad reputation from when it gets used as a substitute (but I fail to see how using a nonstandard keyboard layout would even count as obscurity in the context of an audio attack, as the clear text reference would surely go through the same layout?)
Brilliant suggestion. Have a TRNG or a CSPRNG (if too poor for a TRNG) choose the next layout at random for you, ideally with every keystroke. Good luck cracking that!
...wait, are you telling me Konami shuffling the touch input for e-Amusement PINs[0] was a good idea!?
[0] Okay... deep breath
Konami is a pachinko manufacturer with a side hustle making rhythm games for Japanese arcades. They have an online service that all their games connect to called e-Amusement. You can log into it using an e-Amusement Pass card, and your card is locked to a PIN number you have to set up when you first use it. Cabinets with touchscreens give you a touch keypad, except all the digits are shuffled around, which is a total pain in the ass and you have to do this for every credit.
Indeed. Let me add that how your fingers come into contact with the keys is probably just as important. I recommend a cryptographically rolling choice of dustballs, crumbs, and boogers.
That's already possible, the lack of battery, but likely impractical.
There is enough energy during key press/release to be usable for sending radio signal, however it won't be sufficient to do it while holding a key. A combination of a solar panel, piezoelectric keys and a tiny li-ion (as backup) may be sufficient for a 'battery-less' keyboard, but it will be too expensive.
In 2005 ACM's CCS Zhuang, Zhou and Tygar presented Keyboard Acoustic Emanations Revisited [1]
We examine the problem of keyboard acoustic emanations. We
present a novel attack taking as input a 10-minute sound recording
of a user typing English text using a keyboard, and then recovering
up to 96% of typed characters. There is no need for a labeled
training recording. Moreover the recognizer bootstrapped this way
can even recognize random text such as passwords: In our experiments,
90% of 5-character random passwords using only letters can
be generated in fewer than 20 attempts by an adversary; 80% of 10-
character passwords can be generated in fewer than 75 attempts.
Our attack uses the statistical constraints of the underlying content,
English language, to reconstruct text from sound recordings
without any labeled training data. The attack uses a combination
of standard machine learning and speech recognition techniques,
including cepstrum features, Hidden Markov Models, linear classification,
and feedback-based incremental learning
which builds up on Asonov & Agrawal's work [2] who came up with the idea the previous year (2004).
We show that PC keyboards, notebook keyboards, telephone
and ATM pads are vulnerable to attacks based on
differentiating the sound emanated by different keys. Our
attack employs a neural network to recognize the key being
pressed. We also investigate why different keys produce
different sounds and provide hints for the design of homophonic
keyboards that would be resistant to this type of attack.
That would certainly solve the password issue. And if a sufficiently paranoid person is aware of this attack vector, they could just manually mute the mic at any time they are typing in any sensitive information. I initially was thinking that using a Dvorak or even better custom layout would help, but upon further reflection I think not -- the first-pass output would be equivalent to a substitution cipher, and quickly solved as such.
This topic has me wondering though if it's possible to detect finger positioning or for that matter screen information from the reflection off the typist's eyeballs/eyeglasses shown in a webcam, or perhaps even if possible in principle, in practice most webcam resolution is simply too poor for that.
Zoom is good at filtering out rather loud background noises. I can't imagine that the sound of background typing during a conversation could be detected by the other party.
What? Zoom (by default with auto mic adjustment) catches everything. Typing on laptop is especially bad as it is closer to the mic than the person speaking (unless there is external mic), so it's like a stampede of rhinos.
It shouldn't. Auto (the default) is designed to filter out keystrokes along with other noises, precisely because typing on the laptop is horrible for the reason you mention.
Keystrokes should only be a problem when noise suppression is set to low/off, which you want to do for e.g. playing music.
But noise suppression is applied to sending audio, not receiving it. So you might need to tell your coworkers to re-enable their noise suppression.
I think an attacker would find that many streamers with high quality audio have properly setup their mics with noise gate filters to remove their relatively quiet keystrokes.
I wonder how hard this problem is. I bet it’s actually not that bad. If I were to guess, A huge part of the problem is likely the position of the microphone.
Note that the testing data in the confusion matrix appears to have a uniformish distribution of each key being pressed. I suspect this data was not generated by someone actually typing because you would rarely see numbers and rare letters. It is possible these were simply pressed one at a time rather than in a series of rapid presses.
My guess is this approach uses the mic to identify where the sound of the key press was coming from rather than what each key press sounds like. Which does not invalidate the results but may make it seem less magical. Tbh it’s probably much worse this way because such a model could probably generalize very well across all keyboards and typing styles.
This idea could also be used for good at some point. Imagine “connecting” any keyboard to a device just by enabling the microphone.
It would have its own set of problems: not two people using it at once, eavesdropping would be really easy… but it’d have its own set of interesting applications
When calling my cellular/internet/medical/financial provider, it might be interesting to "see" what they are typing. (Or if they're randomly surfing the internet.)
I can imagine many, many situations where you might do this. But maybe another thing to be worried about are scammees being able to know the Password of people they are calling.
Timing attacks have been attack vector for a while? I remember reading a tool on HN a couple years ago about it. You don’t even need audio, the rate of which you enter the keys into the password field is enough.
There's a great scene in Le chant du Loup (The Wolf's Call) a French 2019 submarine flick (at one point on Netflix) where the sonar guy hears a password typed and reconstructs it from the sound of each keystroke.
I wonder would it be possible / how much data would you need if you'd only have long recording but no clear text to combine it with. Maybe you'd hear space bar as it often has a distinct sound (maybe backspace and return as well), and could create a script that finds the key associated with the sound by brute forcing every key to every unique sound and trying which combinations come out as reasonable sentences.
I wonder how well this would go paired with that attack from a year or so ago that can recover audio from video of a glass window pane. Set up a camera pointed at the outside of your competitor's office? Hear their passwords? heck even send them an email, recieve a reply, and train on them typing emails sent to you?
Wow that's kinda worrying for streamers on Twitch and Youtube etc. They sometimes enter passwords while buying a game on Steam or purchasing something on Amazon. Now they're going to have to think about muting as they are already targets of doxing.
Similar to the unique heartbeat each of us have, the way people type may be another fingerprinting method. When I type passwords and PINs, I often make motions to keys that I'm not hitting to fool the invisible stalker behind me.
Sounds like a great kickstarter/home diy: “mechanical keyboard noise scrambler”, which is just a portable speaker/mic that upon hearing your keyboard, starts playing fake attenuated noise.
Encrypted keyboards. Each key is randomly remapped at the start of each session. Some high security locks already use this to prevent over-the-shoulder cameras capturing codes.
The locations of the numbers move around to prevent mouseloggers from recording your movements.
It seems like any way of doing it would end up slowing down the typist though. If it is just for the password, I could see it being possible, but if you're dealing with lots of information that needs to be protected, then it seems impossible.
When I type my login or wallet password, I've done it so many times that the sound profile is going to be quite different to normal typing. Does the model handle that?
As someone who teaches Dvorak touchtyping I recommend to do it no later than in sweet twenties because you will not be able to type passwords, if this a goal of your learning. Typing passwords is a final exam for my students.
I find this really hard to believe. If it were really possible then people could do it with their ears, and they would be doing it and showing off that they can do it. The human ear (and brain) are really, really good at finding patterns and getting signal out of noise.
Yes. Humans have fantastic audio and video processing abilities, particularly picking out signal from noise. Even now human operators listen to sonar signals on submarines. There's a reason for that.
So they generated training data from one laptop and microphone then generated test data with the exact same laptop and microphone in the same setup, possibly one person pressing the keys too. For the Zoom model they trained a new model with data gathered from Zoom. They call it a practical side channel attack but they didnt do anything to see if this approach could generalize at all
I believe that is the generalisable version of the attack. You're not looking to learn the sound of arbitrary keyboards with this attack, rather you're looking to learn the sound of specific targets.
For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.
Finally, a real security weakness to cite when making fun of people for their mechanical keyboard. Time to start recording the audio of Zoom calls with some particularly loud typers...
45 replies →
I guess more reason to just use a password manager to autofill your password?
30 replies →
I think maybe you wouldn't even need to see the keystrokes. Given enough examples of just audio, I wonder if you could work out the keys using the statistical letter patterns in language.
And there are therefore millions of hours of video that could be attack surface area already in the wild
for a few years I've used rtx voice to remove keyboard typing and other background noise
seems like a very niche case to be warranting the headline and Hackernews front page
I think this linited attack surface can work without having to generalize one model to multiple people or keyboards. One advantage of a Zoom attack is that you get “plaintext” shortly after hearing the “ciphertext” if you can get the target to type into the chat window. And when you hear typing in other contexts it’s likely to be something that matches a handful of grammars that an LLM can recognize already (written languages, programming languages, commands, calculation inputs) - and when it doesn’t, that’s probably a password.
Do keystrokes still come through Zoom? The noise filtering has become extremely aggressive lately, often hear people say “Sorry about that engine / ambulance / city noise” but nobody knows what they’re talking about.
It's for a targeted attack. It doesn't need to be generalized.
How come keyboard sound suppression is not a standard option in all online communication apps? It’s not that hard, keyboard sounds are pretty distinct.
Maybe because it's easier said in an HN comment than in real life
1 reply →
Yeah and in fact, I've heard of this attack being done in the past, but it heavily depends on the typist, the keyboard, etc. Cadence, sound, etc changes with the typist and hardware. This isn't new, and has very few, if any practical applications for wide spread replication.
The answer is that likely all the above are used.
Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.
This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.
Seems simple to defend - use a password manager.
until you have to type your password to unlock it
3 replies →
Good enough for PoC.
it is definitely possible to generalise this, a couple of years ago I did the same with a pair of microphones.
I did a similar acoustic side-channel attack as final year project at uni. There's a treasure trove of findings in this area, I'm just waiting for someone to combine methodologies. There are pretty good results using geometric models, trained and untrained statistical models like this and others, and combining these features with assorted language models.
Here's a few random papers I read along the way:
https://doi.org/10.1007/s10207-019-00449-8 - SonarSnoop, which uses a phone's speaker to produce ultrasonic audio that can be used to profile the user's interaction (e.g. entering swipe-based passcodes).
https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf - "Timing Analysis of Keystrokes and Timing Attacks on SSH", a paper from 2001 that uses statistical models of keystroke timings to retrieve passwords from encrypted SSH traffic.
https://doi.org/10.1145/1609956.1609959 - "Keyboard acoustic emanations revisited", which uses hidden Markov models and some other English language features to recover text based on classification via cepstrum features.
https://doi.org/10.1145/2660267.2660296 - "Context-free Attacks Using Keyboard Acoustic Emanations" which uses a geometric approach, using time-difference-of-arrival to estimate physical locations probabilistically.
I'm not clear why people are poo-pooing this as if it's not a big deal. From a security and espionage point of view this is pretty significant - the audio learning has got to the point that a sensitive audio bug can bascially be key logger. There are a ton of context where an audio tap would be much easier to get in place than a traditional network attack (and with modern shotgun mics, might not even require being in the building). That is applicable to much more than just password stealing.
I've always been a bit fascinated by this attack vector and wondered if would get to this point.
Yes it seems like any possible physical side channel (eg Tempest as well) is now amenable to machine learning approaches. Very interesting indeed.
I wonder if playing the typing sound constantly could help. Not an abstract sound, but recording of your actual typing on this particular keyboard, mixed to play some realistic-sounding phrases / sequences. It should pause for a split second to let your actual keystrokes mix in. That would be really hard to decipher, or to correlate your typing with whatever other events (time to enter a password).
Better yet, play some white noise around you. I heard that it's actually done sometimes at really important meetings.
If you're not such a VIP, just type important things only on your phone; touch screens don't produce enough sound, hopefully.
you would need to tie microphone input with the actual keys typed, and enough of it to train a model. nothingburger
Fascinating. I'm really curious what the acoustic properties are that it's recognizing.
Is it more of a physical fingerprint of each key, such that if you swapped keys/springs the model would need to be updated? So it's produced by manufacturing inconsistencies, the way individual typewriters used to be forensically identified?
Or is more each key being identical, but producing a different resonance pattern within the keyboard/laptop due to the shape of all of the matter surrounding it? If you move the keyboard in the room, do you have to re-train the model?
I also wonder how much it varies depending on how hard you press each key -- not at all or a great deal? And what about by keyboard -- when you compare thin MacBook keys with an external full-height keyboard, is one easier/harder to recognize each key on than the other?
Building on what you said: (1) just the key's properties; (2) key properties relative to other keys; (2) sound transmission and environment between key and microphone; (3) relationship between key and finger; (4) relationship between key and associated dendritis
I presume typing style matters aswell. How quickly you reach each key, rythm, how hard you tend to hit a specific key.
My sense is that they profile the person more than the keyboard.
By the way, some (most?) videoconferencing software removes keyboard sounds from the audio, because it's particularly a distracting problem with laptops where the microphone is right next to the keys.
I'm pretty sure Zoom does this by default as part of its noise cancellation (it's potentially even easier since you can use keydown events to help identify, not just the audio stream).
So as long as basic default noise cancellation is on, that would at least prevent this over regular videoconferencing. And because of this, I'm having a hard time thinking of when else this would be a realistic threat, where the attacker wouldn't already have enough physical access to either install a regular keylogger or else a hidden camera.
Teams definitely don't have this, at least not by default, or not by default in our corp. Anytime somebody on the call starts typing you hear it very clearly.
Meetings between organizations, multi-office cafeterias, or coffee shops, perhaps.
If any random webpage is granted access to the microphone, I would think this could be a problem.
Georgi Gerganov created one a few years ago
https://github.com/ggerganov/kbd-audio
The example figure shows a key hit every half second, which suggests a pecking style of typing at around 24 wpm. This way the model gets very clean waveforms. I wonder how their approach would work with average or fast typists. The sound profiles might be much harder to link to characters.
Even if there was ambiguity, some data is better than none. Given enough training data, I suspect you could find repeatable patterns in standard typists: on a qwerty layout, after typing an "A", "Q" takes 1.2-2.3x as long to type as a "J" kind of pairwise tempo patterns. Anything to reduce the search space from brute-forcing every candidate character.
Even better if the target uses a passphrase, "hXXXse battXXX stXXXXX cXXXXXX" becomes interpretable given a few landmark letter identified with high probability.
Sovjet listened successfully to typewrites back in the 1970s.
Impressive. To be fair, a lot of typewriters jam if you press more than one key at a time, plus they are very loud.
1 reply →
In response to this post, I just open sourced a starter project to a variation of this idea: https://github.com/secretlessai/audio-mnist. I've been interested in doing image classification techniques like CNN on audio data for a while.
A couple years ago for a weekend project I made a simple "audio-mnist" dataset from handwritten digit audio recordings. I never got past a few days worth of work, but open-sourcing it has been on my mind for a minute. This post kicked me into action. Getting some more data, basic CNN examples, etc. could provide a nice starting point for a lot of research and tools.
There is still separate code I'd have to find and make intelligible to create the recordings and split the audio.
Anyway, in case anyone finds part of this process interesting or useful.
Would love a wireless keyboard that works using this! It wouldn’t need any battery, charging or syncing!
Some old TV remotes used to work this way. They were made by Zenith and are called Space Command remotes. Apparently they are the reason TV remotes are sometimes called clickers.
https://www.theverge.com/23810061/zenith-space-command-remot...
I've never considered how odd clicker is for remote but it feels totally natural to me. Like something my parents or grandparents would say. Never thought about where it came from.
Imagine the UX of 1 in 20 characters typed being incorrectly inferred though. The P_failure*Cost impact would strike me as insufferable even if error rate were to improve by an order of magnitude.
I was thinking it could be a keyboard designed to make sounds special sounds so it can be interpreted very accurately
Time to inject background audio of me typing "fuck you" into my zoom calls.
Text-to-keystroke-audio where the text comes from the LLM Prompt "fanfiction based on HGTV's Love It or List It starring an Ewok realtor and Klingon interior designer in iambic pentameter".
The goal is to cause the eavesdropper to totally reevaluate their life choices, and maybe even get caught up in the story.
Tactical noise!
That might make it even easier to decipher. A nice reference point.
Using an image classifier on spectrograms is pretty funny. Not a bad idea, given image classifiers are dime a dozen, but still.
It's actually quite common. One of the big bird recognition apps does just this.
There are multiple apps for this? Seems like PBS KIDS should own the authoritative one, and the licensing.
1 reply →
I don't use the qwerty layout, I use colemak. Likely this mitigates this for myself.
This is just security through obscurity. For real security, you need a cryptographically rolling keyboard layout.
My sister in law uses voice recognition and dictation software, so she doesn't even use a keyboard! Totally safe!
Whereas for practical security, having some common substring in all your passwords that you don't type but insert through some global hotkey would be just fine as a mitigation against eavesdrop attacks.
Yes, that's also obscurity, but obscurity is actually good - it only got a (deservedly) bad reputation from when it gets used as a substitute (but I fail to see how using a nonstandard keyboard layout would even count as obscurity in the context of an audio attack, as the clear text reference would surely go through the same layout?)
Brilliant suggestion. Have a TRNG or a CSPRNG (if too poor for a TRNG) choose the next layout at random for you, ideally with every keystroke. Good luck cracking that!
9 replies →
...wait, are you telling me Konami shuffling the touch input for e-Amusement PINs[0] was a good idea!?
[0] Okay... deep breath
Konami is a pachinko manufacturer with a side hustle making rhythm games for Japanese arcades. They have an online service that all their games connect to called e-Amusement. You can log into it using an e-Amusement Pass card, and your card is locked to a PIN number you have to set up when you first use it. Cabinets with touchscreens give you a touch keypad, except all the digits are shuffled around, which is a total pain in the ass and you have to do this for every credit.
Indeed. Let me add that how your fingers come into contact with the keys is probably just as important. I recommend a cryptographically rolling choice of dustballs, crumbs, and boogers.
Why not just a keyboard that produces random noise?
2 replies →
I'm pretty confident that statistical analysis would give away your layout (assuming there's enough data), I wouldn't be so sure.
Stealing your layout.
At least it would have, until just now, when you recklessly disclosed your secret keyboard layout. :P
That's the equivalent of a shift cipher with a well known offset.
This specific attack could also be easily mitigated by dictating your passwords instead.
Couldn't they just translate the detected keystrokes to colemak layout?
Yes but you would have to know or try all possible layout
this is a targeted attack, it won't do much at all.
Now they can make wireless keyboards that don't need a battery or radio!
That's already possible, the lack of battery, but likely impractical.
There is enough energy during key press/release to be usable for sending radio signal, however it won't be sufficient to do it while holding a key. A combination of a solar panel, piezoelectric keys and a tiny li-ion (as backup) may be sufficient for a 'battery-less' keyboard, but it will be too expensive.
Could you send a separate 'key up' signal on release from the energy of the up-stroke?
1 reply →
This is hardly a new concept btw.
In 2005 ACM's CCS Zhuang, Zhou and Tygar presented Keyboard Acoustic Emanations Revisited [1]
which builds up on Asonov & Agrawal's work [2] who came up with the idea the previous year (2004).
[1] https://dl.acm.org/doi/10.1145/1609956.1609959
[2] https://ieeexplore.ieee.org/document/1301311
maybe...
https://news.mit.edu/2014/algorithm-recovers-speech-from-vib...
So microphones need to get muted automatically by password prompts, seems simple enough in principle.
That would certainly solve the password issue. And if a sufficiently paranoid person is aware of this attack vector, they could just manually mute the mic at any time they are typing in any sensitive information. I initially was thinking that using a Dvorak or even better custom layout would help, but upon further reflection I think not -- the first-pass output would be equivalent to a substitution cipher, and quickly solved as such.
This topic has me wondering though if it's possible to detect finger positioning or for that matter screen information from the reflection off the typist's eyeballs/eyeglasses shown in a webcam, or perhaps even if possible in principle, in practice most webcam resolution is simply too poor for that.
Zoom is good at filtering out rather loud background noises. I can't imagine that the sound of background typing during a conversation could be detected by the other party.
What? Zoom (by default with auto mic adjustment) catches everything. Typing on laptop is especially bad as it is closer to the mic than the person speaking (unless there is external mic), so it's like a stampede of rhinos.
It shouldn't. Auto (the default) is designed to filter out keystrokes along with other noises, precisely because typing on the laptop is horrible for the reason you mention.
Keystrokes should only be a problem when noise suppression is set to low/off, which you want to do for e.g. playing music.
But noise suppression is applied to sending audio, not receiving it. So you might need to tell your coworkers to re-enable their noise suppression.
In this case the parent comment is considering Zoom as an ally, while you are considering it an adversary.
So, in case that “what” was intended to denote some confusion, there is the most likely source.
If you’re on macOS, you can use the voice isolation mic mode.
I think about this attack when streamers on Twitch logs into websites etc.
I think an attacker would find that many streamers with high quality audio have properly setup their mics with noise gate filters to remove their relatively quiet keystrokes.
I wonder how hard this problem is. I bet it’s actually not that bad. If I were to guess, A huge part of the problem is likely the position of the microphone.
Note that the testing data in the confusion matrix appears to have a uniformish distribution of each key being pressed. I suspect this data was not generated by someone actually typing because you would rarely see numbers and rare letters. It is possible these were simply pressed one at a time rather than in a series of rapid presses.
My guess is this approach uses the mic to identify where the sound of the key press was coming from rather than what each key press sounds like. Which does not invalidate the results but may make it seem less magical. Tbh it’s probably much worse this way because such a model could probably generalize very well across all keyboards and typing styles.
This idea could also be used for good at some point. Imagine “connecting” any keyboard to a device just by enabling the microphone.
It would have its own set of problems: not two people using it at once, eavesdropping would be really easy… but it’d have its own set of interesting applications
New? Sovjet listened to typewriters in the 1970s.
But what passwords are you typing while on zoom and why aren't you on mute?
When calling my cellular/internet/medical/financial provider, it might be interesting to "see" what they are typing. (Or if they're randomly surfing the internet.)
Given your username, you might find this interesting:
https://en.m.wikipedia.org/wiki/Tempest_(codename)
TEMPEST considered almost everything from electromagnetic leakage to exactly the attack described here.
How long are you talking to them that you've been able to record samples of the sound of all their keystrokes and perform this analysis?
Call support, get the URLs and logins for all their internal apps. Ouch!
1 reply →
I can imagine many, many situations where you might do this. But maybe another thing to be worried about are scammees being able to know the Password of people they are calling.
Timing attacks have been attack vector for a while? I remember reading a tool on HN a couple years ago about it. You don’t even need audio, the rate of which you enter the keys into the password field is enough.
How do you get the rate?
Maybe any one of your browser tabs has JS listening to the accelerometer. It doesn't even require a permission, AFAIK.
Looking at the traffic of an SSH session?
I seriously doubt that.
There's a great scene in Le chant du Loup (The Wolf's Call) a French 2019 submarine flick (at one point on Netflix) where the sonar guy hears a password typed and reconstructs it from the sound of each keystroke.
https://youtu.be/a9Gz7Bg07u8
This attack is about as realistic as the film: a parallel universe where million to one chances happen nine times out of ten.
I wonder would it be possible / how much data would you need if you'd only have long recording but no clear text to combine it with. Maybe you'd hear space bar as it often has a distinct sound (maybe backspace and return as well), and could create a script that finds the key associated with the sound by brute forcing every key to every unique sound and trying which combinations come out as reasonable sentences.
I wonder how well this would go paired with that attack from a year or so ago that can recover audio from video of a glass window pane. Set up a camera pointed at the outside of your competitor's office? Hear their passwords? heck even send them an email, recieve a reply, and train on them typing emails sent to you?
I heard about stuff like this years ago, and how the CIA could get passwords by pointing long distance microphones at people's windows.
I suspected that the famously terrible Treasury Direct website with its on-screen keyboard was a half-assed attempt to prevent this sort of attack.
Wow that's kinda worrying for streamers on Twitch and Youtube etc. They sometimes enter passwords while buying a game on Steam or purchasing something on Amazon. Now they're going to have to think about muting as they are already targets of doxing.
Similar to the unique heartbeat each of us have, the way people type may be another fingerprinting method. When I type passwords and PINs, I often make motions to keys that I'm not hitting to fool the invisible stalker behind me.
Sounds like a great kickstarter/home diy: “mechanical keyboard noise scrambler”, which is just a portable speaker/mic that upon hearing your keyboard, starts playing fake attenuated noise.
What would be a good quality/price ratio microphone for this sort of keystroke sound recording?
It would be nice to try to tokenize the strokes and then try to assign labels probabilistically.
Encrypted keyboards. Each key is randomly remapped at the start of each session. Some high security locks already use this to prevent over-the-shoulder cameras capturing codes.
The bank pin UI from the game RuneScape comes to mind. https://imgur.io/UAgrY7e?r
The locations of the numbers move around to prevent mouseloggers from recording your movements.
It seems like any way of doing it would end up slowing down the typist though. If it is just for the password, I could see it being possible, but if you're dealing with lots of information that needs to be protected, then it seems impossible.
When I type my login or wallet password, I've done it so many times that the sound profile is going to be quite different to normal typing. Does the model handle that?
There’s an app somewhere that removes your keyboard audio from your audio streams. Sounds like it is a vulnerability remediation.
Passwords aren't the only at-risk category. "This presentation is a tire fire" is a vector, too.
Some systems have a setting to disable touchpad for x milliseconds after a key press.
Do we need something similar for microphones too?
Users will do anything and everything for not getting rid of using FOSS which doesn't spy against a user by definition.
How does it handle against me shrieking loudly while I type? Specifically screaming at my keyboard
So, from this point on, one time passwords only? I can't imagine any other proper solution.
Biometrics, physical security tokens, etc.
If this means the end of those loud mechanical keyboards then good. I never liked the clicking noise.
No it means the beginning of people playing recordings of loud mechanical keyboards all day to thwart the snooping algorithms.
I thought something about this in 1999, this can also be done in high volume beeps like in an ATM.
Physical Access Owns, as usual.
Would mechanical keyboards be easier targets for this than quieter ones?
Oh cool, so it's time to learn Dvorak or other keyboard setups.
You'd need to randomise the keyboard layout every so often, perhaps every 100 strokes.
As someone who teaches Dvorak touchtyping I recommend to do it no later than in sweet twenties because you will not be able to type passwords, if this a goal of your learning. Typing passwords is a final exam for my students.
It's always a good time to get a moonlander. :o)
A plain old desk fan makes an excellent white noise generator
If this means I have to abandon my clicky keyboard I give up.
That would be really terrible for streamers
We’re entering a post-privacy era jesus
i use 1password and have never ever typed password, so i am probably safe.
The risk isn't limited to passwords:
"...passwords, discussions, messages, or other sensitive information..."
Two words for you: Master password.
Touch ID
Death metal.
Suck it.
[dead]
Very interesting that this is even possible. But seems somewhat dangerous, making an audio recording is very easy.
I find this really hard to believe. If it were really possible then people could do it with their ears, and they would be doing it and showing off that they can do it. The human ear (and brain) are really, really good at finding patterns and getting signal out of noise.
You're really surprised that computers can outperform humans at pattern recognition?
Yes. Humans have fantastic audio and video processing abilities, particularly picking out signal from noise. Even now human operators listen to sonar signals on submarines. There's a reason for that.
2 replies →
Computers are better at stuff than humans? Impossible! I am the king of math, no machine beats me in calculating numbers!
Piano players can do it if the typist uses a piano keyboard. Also 88 keys but arranged in one row.
I think that a person could do this too with enough training.
This isn't new. Soviet listened to typewiters back in the 1970s.