← Back to context

Comment by ori_b

17 days ago

No need to guess, it's 10 lines of code. And you can use bpftrace to watch the test program enter the kernel.

Using the libc wrapper will use the vdso. Using syscall() will enter the kernel.

I haven't measured, but calling the vdso should be closer to 5ns.

Someone else did more detailed measurements here:

https://arkanis.de/weblog/2017-01-05-measurements-of-system-...

50ns on a 3GHz CPU core is ~150 cycles. Pushing and popping back the registers to L1 cache is 5-10 cycles each. With having to handle 16 general purpose registers on x86-64 this is already close to or even more than 150 cycles, no?

  • When you measure, what numbers do you get?

    Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.

    • I'm on my mobile. Store to L1 width is typically 32B and you're probably right that CPU will take advantage of it and pack as much registers as it can. This still means 4x store and 4x load for 16 registers. This is ~40 cycles. So 100 cycles for the rest? Still feels minimal.

      12 replies →