Comment by ori_b

17 days ago

When you measure, what numbers do you get?

Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.

I'm on my mobile. Store to L1 width is typically 32B and you're probably right that CPU will take advantage of it and pack as much registers as it can. This still means 4x store and 4x load for 16 registers. This is ~40 cycles. So 100 cycles for the rest? Still feels minimal.

  • A modern x86 processor has about 200 physical registers that get mapped to the 16 architectural registers, with similar for floating point registers. It's unlikely that anything is getting written to cache. Additionally, any writes, absent explicit synchronization or dependencies, will be pipelined.

    It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.

    As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)