Wait, wait, wait: browsers allow websites to store junk on my drive? They take up gigabytes of memory and still write to disk on top of this? Without even asking whether the site can use local storage?
Years and years back when laptops still had HDDs, I had a script to put the Firefox profile &c on a ramdisk and sync it on reboots so that it didn't spin up the drive constantly. I guess I should have kept doing it.
Browsers have an absolute insane level of relatively unchecked permissions to do whatever they want on a client.
There's a lot of effort by browser developers to scope creep the browser into essentially being an OS-agnostic tech stack (one where, conveniently, code can be shipped across the network "as necessary", removing a lot of user agency for the software being ran); Chrome being the biggest driver of this, while Firefox has an extremely weak spine in trying to limit it.
It's fairly dire and I wouldn't be surprised if there's a lot more of these side channel attacks in a lot of web APIs.
It's also the technology that will allow software to run without a continuous connection to the server. If you want to break out of a world where companies own your data it's the tech that is needed.
My shortcut for launching "clean" Chromium session is `chromium --user-data-dir=$(mktemp -d)` -- each launch creates a new transient profile directory under /tmp, which is itself a RAM disk. Persistent settings are achieved by setting system-wide defaults in /etc/chromium, including using system-wide managed policy JSON.
Is this surprising? Websites have long been silently writing to disk, for cache, cookies, and blobs. OPFS just provides a file-system-like API for ultimately the same functionality
"On Chrome and Safari, OPFS supports very large files, up to 60 % of disk space, which is more than sufficient to avoid the page cache on most typical systems, as even a small disk size of 64 GB would allow us to create a 38.4 GB OPFS file."
I am indeed surprised to learn that a random website can write a file that takes up 60% of my disk. Is this obviously a capability of Web browsers?
Firefox doesn't ask permission just to use localstorage, no modern browser does this. The closest thing you get is when a site wants to persist storage with "navigator.storage.persist()", which should prompt you for permission. But localstorage data usually persists anyway, and only gets deleted if the browser's storage is "under pressure", so I've never personally worked on a site or web app that had to use that API.
It's still sandboxed and deleted when the user clears private data for the website.
The main advantage it has over things like cookies, local storage, etc. is that it provides a byte-oriented, random access API and as a result, you can use third-party libraries like SQLite that expect a file API. Which is more important now that we have tools like Emscripten and WebAssembly that let you use existing C libraries on the web. At the same time it has security guarantees such that webpages cannot write arbitrary files that will be viewed and executed by the user.
Also, in theory you could use this side-channel attack on localStorage and sessionStorage. Its only requirement is that it needs an API that writes to disk where you can measure the latency of a synchronous call, since the fingerprinting is just measuring the interference pattern between disk accesses the attacking website does vs. disk accesses that other websites do.
I'm surprised their 1GB file wasn't cached entirely in RAM during the attack, eliminating the SSD from any timing. Do people keep their machines that heavily loaded that a file being constantly read from doesn't stay in the cache?
I’m skeptical of these side channel attacks that rely on training a neural network on specific controlled scenarios on controlled hardware. I believe that with enough time and effort and the perfect circumstances where the user is only visiting their website and doing one other thing that the network was trained on it can match.
It does not seem useful as a general purpose side channel vector.
It depends what you mean by "general purpose." First, these things generalize more often than you'd expect. Second, even in the absence of generalization they're still useful for, e.g., fingerprinting activities to manufacture a unique ID where non previously existed.
The paper isn’t describing a unique ID fingerprint. It’s looking for specific activity patterns to match against training data of running specific commands on specific hardware.
It should be fairly easy to mitigate no? Simply add random access times. Localstorage doesn't need to be that fast. More generally I find it very annoying how much browsers allow by default (javascript, localstorage, gpu access etc.) - there's only a very limited amount of websites I want to be able to run gpu accelerated shaders.
That doesn't work. Because the random times are uniformly distributed it's possible to remove it from the data by additional sampling. You do make it harder because you need a lot more data, but it's still possible to extract the signal, because the noise is uniform.
Still don't really understand how it works - I put the reddit logo into your local storage and it only took 20ms to take it out again instead of 50ms so therefore you have reddit open in another tab?
Attacking website periodically makes random reads from a large file in localStorage. Other tabs and websites open have Javascript running that periodically performs operations that will result in SSD traffic. For example, GMail has a certain polling interval to check for new mail, and each request is going to result in a cache write that makes the SSD busy and delays other conflicting IO operations. Reddit checks for new chat messages. Large memory-heavy websites get paged out of RAM.
The pattern of IO operations that a website makes creates a fingerprint of interference with the IO ops that the attacking website is doing, showing up as differing amounts of latency as the SSD is periodically busy. This fingerprint can then be reconstructed to a specific website by training a CNN on it, basically using a neural net to classify a certain pattern of delays to the IO ops that other websites are doing.
In theory it makes sense, but it seems very noisy. Anything that makes absolutely zero requests or IO operations in the background (like say HN, or most old-school text sites) wouldn't show up, and would be indistinguishable from any other zero-request site. And having other sources of IOps on the same computer - say you're running an Ethereum client that's perpetually updating the blockchain, or you're downloading a bunch of torrents, or you've got DropBox and it's syncing your directory - would introduce noise that throws off the classifier.
That's interesting. Thanks for the explanation. If I read this right this isn't as effective against spinning HD-based systems and there is a dependence on the user maintaining more than one tab as they browse?
If that's the case then my system which is still HD-based is not threatened and since I tend to close tabs and windows and just spin up a new private window for each site while clearing cookies, etc on exit then maybe this is a non-issue for me. Or maybe just block javascript too.
Thats a good explination. It does seem extremely noisy and not at all practical for fingerprinting a user compared to other methods. If you have javascript enabled assume you can be fingerprinted.
That’s timing the cache, that’s old stuff by know. As I understand, this writes a relatively large file („Gigabytes“) using this OPFS api, which is different from the „localStorage“ api.
This seems to use actual filesystem storage on the client, instead of living completely in memory (which may be reasonable given the size of files supported). This allows to actually time SSD IOPS latency by doing random reads.
Collected enough of these samples, together with the information of what else runs on the host, put that in the ML-Blender and the result will be able to tell you, with some accuracy, from a given set of samples, what’s running on the host.
I am sure i misunderstood some things because there are so many caches and unknowns in that setup that I struggle to understand how there could be any correlation, but that’s my understanding so far.
There's no such thing as a sandbox "on your machine" when you really think about it. The code still runs on the same hardware and there are tons of ways to fiddle with said hardware that could be exploited (like rowhammer). The only "real" sandbox is fully dedicated hardware down to bare metal with zero connections to sensitive systems.
And now that Google's web environment integrity is getting repackaged into captchas, it seems we won't even be able to try to block such things in the future...
Wait, wait, wait: browsers allow websites to store junk on my drive? They take up gigabytes of memory and still write to disk on top of this? Without even asking whether the site can use local storage?
Years and years back when laptops still had HDDs, I had a script to put the Firefox profile &c on a ramdisk and sync it on reboots so that it didn't spin up the drive constantly. I guess I should have kept doing it.
It's a sad day when Arch users are right (again) https://wiki.archlinux.org/title/Firefox/Profile_on_RAM
Browsers have an absolute insane level of relatively unchecked permissions to do whatever they want on a client.
There's a lot of effort by browser developers to scope creep the browser into essentially being an OS-agnostic tech stack (one where, conveniently, code can be shipped across the network "as necessary", removing a lot of user agency for the software being ran); Chrome being the biggest driver of this, while Firefox has an extremely weak spine in trying to limit it.
It's fairly dire and I wouldn't be surprised if there's a lot more of these side channel attacks in a lot of web APIs.
Now that we have AI, can we go back to real apps and native tech stacks? And revert the browser to a text-display interface?
2 replies →
Flash ended up getting blocked/banned by all browsers because it turned into a giant gaping security hole.
> By January 2021, all major browsers were blocking all Flash content unconditionally.
It looks like we-the-users need to be blocking any and every one of these parasites.
https://en.wikipedia.org/wiki/Adobe_Flash
1 reply →
It's also the technology that will allow software to run without a continuous connection to the server. If you want to break out of a world where companies own your data it's the tech that is needed.
My shortcut for launching "clean" Chromium session is `chromium --user-data-dir=$(mktemp -d)` -- each launch creates a new transient profile directory under /tmp, which is itself a RAM disk. Persistent settings are achieved by setting system-wide defaults in /etc/chromium, including using system-wide managed policy JSON.
Is this surprising? Websites have long been silently writing to disk, for cache, cookies, and blobs. OPFS just provides a file-system-like API for ultimately the same functionality
Yes? From the paper:
"On Chrome and Safari, OPFS supports very large files, up to 60 % of disk space, which is more than sufficient to avoid the page cache on most typical systems, as even a small disk size of 64 GB would allow us to create a 38.4 GB OPFS file."
I am indeed surprised to learn that a random website can write a file that takes up 60% of my disk. Is this obviously a capability of Web browsers?
6 replies →
If you open an incognito window in chromium it is profile on ram
Hostile LLMs? In my browser? At this time of the year?
> Without even asking whether the site can use local storage?
Where did you see this in the article? I had some recollection that Firefox at least did require asking the user.
Firefox doesn't ask permission just to use localstorage, no modern browser does this. The closest thing you get is when a site wants to persist storage with "navigator.storage.persist()", which should prompt you for permission. But localstorage data usually persists anyway, and only gets deleted if the browser's storage is "under pressure", so I've never personally worked on a site or web app that had to use that API.
6 replies →
That surprised me as well.
I thought the whole point of cookies, local storage, session storage, and indexed DB were to avoid what origin private file system is doing.
You mean I could have just saved stuff as a file this whole time instead of serializing it to a string? Why didn't we just do this from the start?
It's still sandboxed and deleted when the user clears private data for the website.
The main advantage it has over things like cookies, local storage, etc. is that it provides a byte-oriented, random access API and as a result, you can use third-party libraries like SQLite that expect a file API. Which is more important now that we have tools like Emscripten and WebAssembly that let you use existing C libraries on the web. At the same time it has security guarantees such that webpages cannot write arbitrary files that will be viewed and executed by the user.
Also, in theory you could use this side-channel attack on localStorage and sessionStorage. Its only requirement is that it needs an API that writes to disk where you can measure the latency of a synchronous call, since the fingerprinting is just measuring the interference pattern between disk accesses the attacking website does vs. disk accesses that other websites do.
And Web Developers want more and more OS features built into the browser. This is why I'm against it. Features are only ever abused.
> Even Meta and Yandex were recently caught joining in the privacy-invasive free-for-all.
Damn, even Meta have joined the dark side?
Sic transit gloria mundi :'(
I'm surprised their 1GB file wasn't cached entirely in RAM during the attack, eliminating the SSD from any timing. Do people keep their machines that heavily loaded that a file being constantly read from doesn't stay in the cache?
I’m skeptical of these side channel attacks that rely on training a neural network on specific controlled scenarios on controlled hardware. I believe that with enough time and effort and the perfect circumstances where the user is only visiting their website and doing one other thing that the network was trained on it can match.
It does not seem useful as a general purpose side channel vector.
Publish or perish. It worked once, in a controlled lab, mostly (80-90% guess). Good enough for more millions in funding..
Not really joking here.
https://hannesweissteiner.com/
https://hannesweissteiner.com/publications/frost/
It depends what you mean by "general purpose." First, these things generalize more often than you'd expect. Second, even in the absence of generalization they're still useful for, e.g., fingerprinting activities to manufacture a unique ID where non previously existed.
The paper isn’t describing a unique ID fingerprint. It’s looking for specific activity patterns to match against training data of running specific commands on specific hardware.
That's basically just a research, theoretical attack vector. It doesn't mean it's viable for general purpose old school mass privacy invasion
I was interested in this so i created a proof of concept: https://github.com/brammittendorff/opfs-ssd-timing
It should be fairly easy to mitigate no? Simply add random access times. Localstorage doesn't need to be that fast. More generally I find it very annoying how much browsers allow by default (javascript, localstorage, gpu access etc.) - there's only a very limited amount of websites I want to be able to run gpu accelerated shaders.
> Simply add random access times.
That doesn't work. Because the random times are uniformly distributed it's possible to remove it from the data by additional sampling. You do make it harder because you need a lot more data, but it's still possible to extract the signal, because the noise is uniform.
The interesting mitigation would be snapping I/O to a course clock.
You could then set it to hold the result until the next tick.
E.g. An I/O tick of 20ms, and it would only return on 20ms boundaries, then almost every SSD would look the same.
It would slow down the API a bit, but privacy has tradeoffs.
1 reply →
The random times don't have to be uniformly distributed. Though it's enough for attackers to know the distribution to de-noisify it.
Why don't SSDs trust websites anymore?
Because every time they open up, the site gives them the F̶R̶O̶S̶T̶ cold shoulder.
That's it, I'm turning off JavaScript for everything non essential from here on out.
Been doing that for years and do not regret it.
I laugh at your spying attempts from my HD-equipped laptop, ...
The paper (https://hannesweissteiner.com/pdfs/frost.pdf) cites two earlier papers (from 2014 and 2017) that did the same for HDDs.
I laugh at your spying attempts from my cURL e-mail service that I use instead of a web browser.
Or mine where the entire browser profile is on a ramdisk.
Higher IO latencies in HDs might actually make this attack easier - more contention means more bits of data.
I got $HOME in a huge HDD because it was cheaper. I guess we belong to the cool kids club now?
For a more technical read: https://news.ycombinator.com/item?id=48345822
Still don't really understand how it works - I put the reddit logo into your local storage and it only took 20ms to take it out again instead of 50ms so therefore you have reddit open in another tab?
I assume it's something like this:
Attacking website periodically makes random reads from a large file in localStorage. Other tabs and websites open have Javascript running that periodically performs operations that will result in SSD traffic. For example, GMail has a certain polling interval to check for new mail, and each request is going to result in a cache write that makes the SSD busy and delays other conflicting IO operations. Reddit checks for new chat messages. Large memory-heavy websites get paged out of RAM.
The pattern of IO operations that a website makes creates a fingerprint of interference with the IO ops that the attacking website is doing, showing up as differing amounts of latency as the SSD is periodically busy. This fingerprint can then be reconstructed to a specific website by training a CNN on it, basically using a neural net to classify a certain pattern of delays to the IO ops that other websites are doing.
In theory it makes sense, but it seems very noisy. Anything that makes absolutely zero requests or IO operations in the background (like say HN, or most old-school text sites) wouldn't show up, and would be indistinguishable from any other zero-request site. And having other sources of IOps on the same computer - say you're running an Ethereum client that's perpetually updating the blockchain, or you're downloading a bunch of torrents, or you've got DropBox and it's syncing your directory - would introduce noise that throws off the classifier.
That's interesting. Thanks for the explanation. If I read this right this isn't as effective against spinning HD-based systems and there is a dependence on the user maintaining more than one tab as they browse?
If that's the case then my system which is still HD-based is not threatened and since I tend to close tabs and windows and just spin up a new private window for each site while clearing cookies, etc on exit then maybe this is a non-issue for me. Or maybe just block javascript too.
1 reply →
Thats a good explination. It does seem extremely noisy and not at all practical for fingerprinting a user compared to other methods. If you have javascript enabled assume you can be fingerprinted.
That’s timing the cache, that’s old stuff by know. As I understand, this writes a relatively large file („Gigabytes“) using this OPFS api, which is different from the „localStorage“ api. This seems to use actual filesystem storage on the client, instead of living completely in memory (which may be reasonable given the size of files supported). This allows to actually time SSD IOPS latency by doing random reads.
Collected enough of these samples, together with the information of what else runs on the host, put that in the ML-Blender and the result will be able to tell you, with some accuracy, from a given set of samples, what’s running on the host.
I am sure i misunderstood some things because there are so many caches and unknowns in that setup that I struggle to understand how there could be any correlation, but that’s my understanding so far.
I think what would be more interesting is using this as a side channel for communication between different sandboxed contexts.
It's really not surprising that letting websites run arbitrary code on your machine, even in a sandbox, would lead to things like this.
There's no such thing as a sandbox "on your machine" when you really think about it. The code still runs on the same hardware and there are tons of ways to fiddle with said hardware that could be exploited (like rowhammer). The only "real" sandbox is fully dedicated hardware down to bare metal with zero connections to sensitive systems.
And now that Google's web environment integrity is getting repackaged into captchas, it seems we won't even be able to try to block such things in the future...
correction, websites have a way to spy on visitors: Javascript.
Ahhh Arstechnica, I wonder if the technical article is by Dan Goodin. (It is)
I enjoyed his C Programming books for dummies series.
Maybe don’t let Google decide what its browser can and can’t do on my computer…
Why do browsers need to do this? Feels like an edge case need, at best, that was likely just a cover for some power Google wanted to exploit.
sounds like nonsense. i guess this works on some test environment but not in real life. you would never know that I am running tetris, for example
{first.last}@tugraz.at