← Back to context

Comment by nonrandomstring

2 days ago

Elaborate memory management (paging) systems need caching of lookups for high performance. But they can go wrong. The post was made in a security/safety context but did I miss something, because it didn't seems to make clear what the dangers are?

I only know x86/64, but I assume most page table caching would be somewhat similar.

Basically, if you don't handle the TLB properly, the CPU will not know that page mappings and/or page permissions have changed. So if you had a page mapped RW, and then changed the mapping to a RO page (such as setting up COW), but failed to flush the TLB (or at least call INVLPG to flush the entry), the CPU might use those stale permissions and grant write access on that page when it shouldn't. The same could happen for changing a region of the VA space to use a different physical page, where the next bit of code would hit the old page (and who knows what state it might be in or what it could be being used for).

The TLB is not super-complicated, but it has some quirks (it's been so long since I've done anything with it, the PCID handling rules were new to me; didn't even support it back when).

The article (towards its end) discusses a serious bug in the INVLPG instruction of the Intel Gracemont processor cores (which are the E-cores used in Alder Lake, Raptor Lake, Raptor Lake Refresh, Alder Lake N, Amston Lake, Twin Lake), which fails to invalidate all the entries that it should invalidate, in certain circumstances.

I'm no expert on TLB invalidation bugs but generally they allow for an attacker to read/write arbitrary memory.

https://googleprojectzero.blogspot.com/2019/01/taking-page-f...

  • I don't mean to be a pedant, so someone please correct me if I'm wrong, but I don't think TLB mishandling would result in arbitrary memory access (I suppose in the strictest sense arbitrary can just mean random, but generally I have understood it to imply that the address can be attacker controlled, which a stale TLB wouldn't allow).

    Unless you're like Microsoft (from your link) and accidentally leave the page tables writable from userspace for 2 months. But that's not really a TLB error, that's just L-O-L, wow!

    • Random access is arbitrary access, given enough time. You can try over and over again until you get lucky.

      Imagine I'm a user with local shell access trying to read a secret owned by root. Maybe I can't read the secret, but I can do something which makes another program read the secret. If I can make that program swap (perhaps by wasting a bunch of RAM to create memory pressure), and swapping has some probability of triggering a TLB invalidation bug that lets me see the old page, I win, although it might take awhile.

  • Read-Write-eXecute TLB memory region can be found in JavaScript, Java, Dalvik (Android), and Python.

    • Modern javascript engines (namely V8) avoid RWX, although last time I checked there's been a backslide as part of WASM implementation.

      CPython also no longer appears to create RWX mappings even for ctypes, although you can of course still mmap them manually.

      2 replies →

It looks like the last 20 or so pages of the PDF contain two case studies. I read the first one, which lead to (nondeterministic) kernel errors.

Perhaps “hacker” should be “crazy bug debugger”, but anybody who is working with TLB issues is a hacker in my book.

There is no “CVE” vulnerability in the slides, for sure.

I conclude that the title is wrong. Every developer doesn't need to know these things - only kernel developers need to know about TLB invalidation.

  • Every developer needs to know that cache invalidation is one of the two hard things in computer science - and that people further down in your stack occasionally get it wrong.