Comment by amluto

12 hours ago

If you do this, please be aware that there is absolutely no guarantee that you will not observe time going backwards. You probably will not have one thread ask for the time twice in a row and get results that are out of order, but you can have thread 1 ask for the time and do a store-release and then have thread 2 do a load-acquire, observe thread 1’s write, and ask for the time, and thread 2’s time may be earlier than thread 1’s. This is because RDTSC by itself does not respect x86’s memory order — it does not act like a load.

source: I wrote a bunch of this code and I’ve tested it fairly extensively.

4 comments

amluto

hmpc 6 hours ago

This is explicitly called out in the post as well as the Intel instruction manual. Every codebase I've ever seen that reads the TSC either issues an LFENCE or uses RDTSCP.

In my benchmarks RDTSCP has a slight advantage, despite the slower latency on paper, because it doesn't fully serialise the instruction stream (later instructions can start executing, unlike with LFENCE). Whether the ECX clobber will outweigh that will depend on the situation.

vlovich123 11 hours ago

Do you know if the rust quanta / fastant crates have this problem? I feel like they don’t but I haven’t actually dug into the implementation. The reason I think not is that at least in the case of quanta the clock value can be made to be broadcast from a single clock maintainer thread. But even when its using plain rdtsc it says it upholds monotonicity barring kernel/virtualization bugs:

https://docs.rs/quanta/latest/quanta/struct.Instant.html

So I think it’s possible to do this correctly?

amluto 10 hours ago

If it’s calling clock_gettime, it should be fine. If it uses RDTSCP, it should be fine (assuming your system actually has synchronized TSCs, and there is a long history of this failing). If it uses the sadly vendor-dependent magic incantation involving LFENCE or MFENCE, it should be fine. If it does plain RDTSC, it may not be fine.
(I have no special insight into what Intel and AMD CPUs do under the hood, but my best guess has always been that they are implemented by ucode that has no dependencies on anything in the register file except whatever might be internal to the ucode for the instruction itself. And the dispatch logic will cheerfully schedule it as such, including moving it before loads that precede in the instruction stream. Since RDTSC itself isn’t a load, the magic that makes all loads be acquires does not apply. RDTSCP is probably an excessively heavily pessimized version that waits for earlier loads to actually happen. The really nice hypothetical version where RDTSC “loads” a virtual loadable register in the coherency domain and can be speculated just like a real load is probably too complex to be worth implementing.)
hmpc 4 hours ago

It's definitely possible to do correctly, but looking through the code for both crates it doesn't look like they take the necessary precautions (issuing a fence or using RDTSCP). Which is a little weird because at least quanta explicitly checks for RDTSCP support, but then doesn't seem to use it.
(I'm not a Rust expert and I'm on my phone though, so I might be missing something.)