Comment by ori_b
17 days ago
The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming
As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.
That makes me suspect something else is up in that benchmark.
For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:
f:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $8, %rsp
movq $42, -128(%rbp)
movq $42, -120(%rbp)
movq $42, -112(%rbp)
movq $42, -104(%rbp)
movq $42, -96(%rbp)
movq $42, -88(%rbp)
movq $42, -80(%rbp)
movq $42, -72(%rbp)
movq $42, -64(%rbp)
movq $42, -56(%rbp)
movq $42, -48(%rbp)
movq $42, -40(%rbp)
movq $42, -32(%rbp)
movq $42, -24(%rbp)
movq $42, -16(%rbp)
movq $42, -8(%rbp)
nop
leave
.cfi_def_cfa 7, 8
ret
Benchmark is simple but I find it worthwhile because of the fact that (1) it is run across 15 different platforms (different CPUs, libc's) and results are pretty much reproducible, and (2) it is run through gbenchmark which has a mechanism to make the measurements statistically significant.
Interesting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.
Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.
The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...
syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.
As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.
This is from Intel manual:
So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.
I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.
I can't reproduce. When I run The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca..., here are the numbers on the computers I have:
So, I've tested multiple times in multiple ways, and the results don't seem to match.
Interesting because on my machine I can reproduce the results. It's a pretty hefty 5.3GHz and recentish (Raptor Lake) Intel i7-13850HX CPU:
EDIT: also reproducible on my skylake-x (Gold 6152) machine
With turbo-boost @3.7Ghz enabled:
With turbo-boost disabled (@2.1GHz base frequency):
I wonder why your results are so much different. Mine almost linearly scale with the core frequency.
6 replies →