Comment by ot

1 month ago

You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.

This is not well documented unfortunately, and I'm not aware of open-source implementations of this.

EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.

9 comments

jerrinot 1 month ago

That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!

ot 1 month ago
Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)
- catlifeonmars 1 month ago
  
  I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway
  
  1 reply →

nly 1 month ago

Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?

Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Tbh I thought clock_gettime was a vdso based virtual syscall anyway

ot 1 month ago

> Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Yes, that's exactly what a seqlock (reader) is.

mgaunard 1 month ago

clock_gettime is not doing a syscall, it's using vdso.

jerrinot 1 month ago

clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.