← Back to context

Comment by menaerus

15 days ago

Data suggests that they are, and common sense too. And your point of reference is a little bit problematic since there's no code attached so it's hard for people to validate the measurements.

Since you have been laser-focused on sqrt "bad" performance, and obvious optimization with sqrtrec, but also decided to ignore the rest of the results, maybe you can explain why there is such a large difference in your measurements between seemingly very similar platforms in terms of compute. After all this is pure compute problem.

For example, why does 4.9GHz CPU (AMD Ryzen™ 5 7545U) yield 2x to 4x worse results than 5.5GHz CPU (AMD Ryzen™ 7 9700X)?

    AMD Ryzen 7 9700X Desktop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            38.6 ns         38.5 ns     18160546
    bench_getpid                            39.9 ns         39.9 ns     17703749
    bench_close                             45.2 ns         45.1 ns     15711379
    bench_syscall                           42.2 ns         42.1 ns     16638675
    bench_sched_yield                       81.7 ns         81.6 ns      8623522
    
    AMD Ryzen 5 PRO 7545U Laptop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                             106 ns          106 ns      6581746
    bench_getpid                             111 ns          111 ns      6271878
    bench_close                              116 ns          116 ns      5944154
    bench_syscall                           85.9 ns         85.9 ns      7317584
    bench_sched_yield                        315 ns          315 ns      2249333

Because the low power laptop part has rather different characteristics to the desktop part, according to CPUmark benchmarks. It's not surprising that the low power part is slower; it's surprising when the newer/faster part is significantly slower for pure CPU operations. Different compliation flags, I guess.

Edit: And, apparently, because regardless of what I do with `cpupower`, and twiddling the governors, cpu frequency on this machine is getting scaled. I've run out of time to debug that, I'll update later.

https://www.cpubenchmark.net/compare/6205vs6367vs4835/AMD-Ry...

I'm not sure what's up with sched_yield.

I can also replicate these numbers with `perf bench syscall basic`.

  • I mean, the base and turbo frequency are about the same on both parts, and the workload is very very simple. Case where TDP would matter is with the workload sucking up all the power budget of a whole chip in which case frequency would have to be downscaled in order to remain within the limits. I doubt this is the case here but I guess this can also be measured if one is curious enough. In my case, only sqrt was slower, the rest was 2x faster on a more modern CPU.

    I reran the experiment in a VM, on a company's Xeon server clocked @2.2GHz, and results are again pretty much the same as before:

      ----------------------------------------------------------------------------
      Benchmark                                  Time             CPU   Iterations
      ----------------------------------------------------------------------------
      bench_getuid                             778 ns          778 ns       901999
      bench_getpid                             774 ns          774 ns       902699
      bench_close                              779 ns          779 ns       896939
      bench_syscall                            761 ns          761 ns       916941
      bench_sched_yield                       1121 ns         1121 ns       566012
      bench_clock_gettime                     22.1 ns         22.1 ns     31579512
      bench_clock_gettime_tai                 22.0 ns         22.0 ns     31502402
      bench_clock_gettime_monotonic           22.1 ns         22.1 ns     31848177
      bench_clock_gettime_monotonic_raw       22.4 ns         22.4 ns     30953415
      bench_nanosleep0                       57424 ns         6967 ns        98218
      bench_nanosleep0_slack1                 6342 ns         6340 ns       110862
      bench_nanosleep1_slack1                 6310 ns         6308 ns       111064
      bench_pthread_cond_signal               3.23 ns         3.23 ns    216726274
      bench_assign                           0.323 ns        0.323 ns   1000000000
      bench_sqrt                              2.64 ns         2.64 ns    265275643
      bench_sqrtrec                           4.40 ns         4.40 ns    160328959
      bench_nothing                          0.000 ns        0.000 ns   1000000000