50ns on a 3GHz CPU core is ~150 cycles. Pushing and popping back the registers to L1 cache is 5-10 cycles each. With having to handle 16 general purpose registers on x86-64 this is already close to or even more than 150 cycles, no?
Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.
No need to guess, it's 10 lines of code. And you can use bpftrace to watch the test program enter the kernel.
Using the libc wrapper will use the vdso. Using syscall() will enter the kernel.
I haven't measured, but calling the vdso should be closer to 5ns.
Someone else did more detailed measurements here:
https://arkanis.de/weblog/2017-01-05-measurements-of-system-...
50ns on a 3GHz CPU core is ~150 cycles. Pushing and popping back the registers to L1 cache is 5-10 cycles each. With having to handle 16 general purpose registers on x86-64 this is already close to or even more than 150 cycles, no?
When you measure, what numbers do you get?
Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.
13 replies →