Comment by neuroelectron

2 months ago

There are fast instructions (e.g., REP STOSx, AVX zero stores, dc zva) and tricks (MTE, zero pages), but no magic CPU instruction exists that transparently and efficiently randomizes or zeros the stack on function calls. You think there would be one and I bet there are on some specialized high-security systems, but I'm not sure even where you would find such a product. Telecom certainly isn't it.

7 comments

neuroelectron

db48x 2 months ago

There are proposed cpu architectures that work that way, like the Mill <https://millcomputing.com/>. Where most cpus support multiple calling conventions the Mill enforces a single calling convention in hardware. There is a hardware `call` instruction that does all the work directly, along with a corresponding `ret` instruction for returning from a function call. It also uses its equivalent of the TLB to ensure that each function is only granted permission to read from that portion of the stack which contains its arguments; any attempt to read outside that region would result in a permission error that causes the read to return a NaR (Not a Result, akin to a floating point NaN).

As an additional protection, new stack frames are implicitly zeroed as they are created. I assume this is done by filling the CPU cache with zeros for those addresses before continuing to execute the called function. No need to wait for actual zeros to be written to main memory.

https://millcomputing.com/wiki/Protection#Protecting_Stacks

smj-edison 2 months ago
This is really interesting—how do stack references work in this design?
- db48x 2 months ago
  
  Technically I think you can read the “whole” stack; it’s only reads off of the ends of the stack as a whole that are prevented. However, note that the start of your current stack may not really be the start of the real stack.
  Consider the case of a system call, such as `read`. You’re in user space and you have some stack frames on the stack as usual. You allocate a buffer on the stack (there’s a cpu instruction for that; it basically just extends your “turf¹” to include more of the stack page, and zeros it as mentioned) to hold the data you want to read. You then call `read` with the `call` instruction, including the address of the buffer and the buffer size as arguments. So far everything is very straight–forward.
  But `read` is actually in a different protection domain; it’s part of the kernel. The CPU uses metadata previously set up by the kernel to turn this into a “portal call”. After the portal call your thread will be given a different protection domain. In principle this is the kernel’s protection domain, but in reality the kernel might split that up in many complicated ways. What is relevant here is that the turf of this protection domain has been modified to include this new stack frame. From the perspective of `read`, the stack has just started; there are no prior frames. The reality is that this stack frame is still part of the stack of the caller, it’s only the turf that has changed. Those prior stack frames still exist, but they are unreadable. Worse, the buffer is also unreadable; it’s located at an address that is not part of the kernel’s turf.
  So obviously there needs to be another set of instructions for modifying turfs. The full set of obvious modifications are available, but the relevant one here is a temporary grant of read and/or write permissions to a function you are about to call. You would insert a `pass` instruction to pass along access to the buffer for the duration of the call. This access is automatically revoked after the call returns. (Ideally you wouldn’t actually have to do this manually for every portal call; instead you would call a non–portal `read` function in libc. This function’s job is to make the portal call, and whoever wrote it makes sure to include the `pass` instruction.)
  ¹ A turf is the set of addresses that a given thread running in a given protection domain can read and/or write.

mjevans 2 months ago

You couldn't do random, but with a predictable performance hit to memory, cache and write-line use stack addresses COULD be isolated for a program, for a library, etc.

It'd be expensive though; every context switch would require it's own stack and pushing / restoring one more register. There's GOOD reason programs don't work that way and are supposed to not rely on values outside of properly initialized (and not later clobbered) memory.

neuroelectron 2 months ago
It should be efficient though, that's the point. Specialized hardware or instructions should be able to zero the stack in a single cycle, instead it's much more expensive. Of course the problem with this is it could be used to hide things just as easily, making it impossible to reverse engineer an unknown exploit.
- mjevans 2 months ago
  
  Why would a specialized instruction be necessary? 'the stack' is stored in memory just like everything else.
  Expensive is the (very slow for modern CPUs) operation of _writing_ that change in value out to memory at it's distant and slow speed compared to that which the CPU operates at, as well as the overhead of synchronizing that write to any other caches of those memory locations.
  Maybe you're thinking of the trick of a band new page of memory mapped memory that is 'zeroed' but is in reality just a special 'all zeros' page in the virtual to physical memory lookup table? Those still need to be zeroed by real writes at some point, if they're ever used.

dwattttt 2 months ago

CPUs already special case xor reg,reg as zeroing out the register, breaking any data dependency on it. If zeroing bits of the stack were common enough, I'd believe CPUs could be made that handled it efficiently (they already special case the stack; push/pop)