Comment by menaerus
15 days ago
Data suggests that they are, and common sense too. And your point of reference is a little bit problematic since there's no code attached so it's hard for people to validate the measurements.
Since you have been laser-focused on sqrt "bad" performance, and obvious optimization with sqrtrec, but also decided to ignore the rest of the results, maybe you can explain why there is such a large difference in your measurements between seemingly very similar platforms in terms of compute. After all this is pure compute problem.
For example, why does 4.9GHz CPU (AMD Ryzen™ 5 7545U) yield 2x to 4x worse results than 5.5GHz CPU (AMD Ryzen™ 7 9700X)?
AMD Ryzen 7 9700X Desktop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 38.6 ns 38.5 ns 18160546
bench_getpid 39.9 ns 39.9 ns 17703749
bench_close 45.2 ns 45.1 ns 15711379
bench_syscall 42.2 ns 42.1 ns 16638675
bench_sched_yield 81.7 ns 81.6 ns 8623522
AMD Ryzen 5 PRO 7545U Laptop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 106 ns 106 ns 6581746
bench_getpid 111 ns 111 ns 6271878
bench_close 116 ns 116 ns 5944154
bench_syscall 85.9 ns 85.9 ns 7317584
bench_sched_yield 315 ns 315 ns 2249333
Because the low power laptop part has rather different characteristics to the desktop part, according to CPUmark benchmarks. It's not surprising that the low power part is slower; it's surprising when the newer/faster part is significantly slower for pure CPU operations. Different compliation flags, I guess.
Edit: And, apparently, because regardless of what I do with `cpupower`, and twiddling the governors, cpu frequency on this machine is getting scaled. I've run out of time to debug that, I'll update later.
https://www.cpubenchmark.net/compare/6205vs6367vs4835/AMD-Ry...
I'm not sure what's up with sched_yield.
I can also replicate these numbers with `perf bench syscall basic`.
I mean, the base and turbo frequency are about the same on both parts, and the workload is very very simple. Case where TDP would matter is with the workload sucking up all the power budget of a whole chip in which case frequency would have to be downscaled in order to remain within the limits. I doubt this is the case here but I guess this can also be measured if one is curious enough. In my case, only sqrt was slower, the rest was 2x faster on a more modern CPU.
I reran the experiment in a VM, on a company's Xeon server clocked @2.2GHz, and results are again pretty much the same as before: