Comment by menaerus

17 days ago

I'm on my mobile. Store to L1 width is typically 32B and you're probably right that CPU will take advantage of it and pack as much registers as it can. This still means 4x store and 4x load for 16 registers. This is ~40 cycles. So 100 cycles for the rest? Still feels minimal.

12 comments

menaerus

ori_b 17 days ago

A modern x86 processor has about 200 physical registers that get mapped to the 16 architectural registers, with similar for floating point registers. It's unlikely that anything is getting written to cache. Additionally, any writes, absent explicit synchronization or dependencies, will be pipelined.

It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.

As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)

menaerus 17 days ago
Where is the state/registers written to then if not L1? I'm confused.
What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.
- ori_b 17 days ago
  
  The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming
  As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.
  That makes me suspect something else is up in that benchmark.
  For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:
  f: .LFB6: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $8, %rsp movq $42, -128(%rbp) movq $42, -120(%rbp) movq $42, -112(%rbp) movq $42, -104(%rbp) movq $42, -96(%rbp) movq $42, -88(%rbp) movq $42, -80(%rbp) movq $42, -72(%rbp) movq $42, -64(%rbp) movq $42, -56(%rbp) movq $42, -48(%rbp) movq $42, -40(%rbp) movq $42, -32(%rbp) movq $42, -24(%rbp) movq $42, -16(%rbp) movq $42, -8(%rbp) nop leave .cfi_def_cfa 7, 8 ret
  
  9 replies →