Comment by ori_b

17 days ago

A modern x86 processor has about 200 physical registers that get mapped to the 16 architectural registers, with similar for floating point registers. It's unlikely that anything is getting written to cache. Additionally, any writes, absent explicit synchronization or dependencies, will be pipelined.

It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.

As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)

11 comments

ori_b

menaerus 17 days ago

Where is the state/registers written to then if not L1? I'm confused.

What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.

ori_b 17 days ago
The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming
As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.
That makes me suspect something else is up in that benchmark.
For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:
f: .LFB6: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $8, %rsp movq $42, -128(%rbp) movq $42, -120(%rbp) movq $42, -112(%rbp) movq $42, -104(%rbp) movq $42, -96(%rbp) movq $42, -88(%rbp) movq $42, -80(%rbp) movq $42, -72(%rbp) movq $42, -64(%rbp) movq $42, -56(%rbp) movq $42, -48(%rbp) movq $42, -40(%rbp) movq $42, -32(%rbp) movq $42, -24(%rbp) movq $42, -16(%rbp) movq $42, -8(%rbp) nop leave .cfi_def_cfa 7, 8 ret
- menaerus 16 days ago
  
  Benchmark is simple but I find it worthwhile because of the fact that (1) it is run across 15 different platforms (different CPUs, libc's) and results are pretty much reproducible, and (2) it is run through gbenchmark which has a mechanism to make the measurements statistically significant.
  Interesting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.
  Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.
  The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...
  syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.
  As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.
  This is from Intel manual:
  Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).
  So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.
  I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.
  
  8 replies →