Comment by menaerus

17 days ago

Where is the state/registers written to then if not L1? I'm confused.

What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.

The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming

As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.

That makes me suspect something else is up in that benchmark.

For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:

    f:
   .LFB6:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $8, %rsp
        movq    $42, -128(%rbp)
        movq    $42, -120(%rbp)
        movq    $42, -112(%rbp)
        movq    $42, -104(%rbp)
        movq    $42, -96(%rbp)
        movq    $42, -88(%rbp)
        movq    $42, -80(%rbp)
        movq    $42, -72(%rbp)
        movq    $42, -64(%rbp)
        movq    $42, -56(%rbp)
        movq    $42, -48(%rbp)
        movq    $42, -40(%rbp)
        movq    $42, -32(%rbp)
        movq    $42, -24(%rbp)
        movq    $42, -16(%rbp)
        movq    $42, -8(%rbp)
        nop
        leave
        .cfi_def_cfa 7, 8
        ret

  • Benchmark is simple but I find it worthwhile because of the fact that (1) it is run across 15 different platforms (different CPUs, libc's) and results are pretty much reproducible, and (2) it is run through gbenchmark which has a mechanism to make the measurements statistically significant.

    Interesting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.

    Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.

    The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...

    syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.

    As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.

    This is from Intel manual:

      Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).
    

    So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.

    I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.

    • I can't reproduce. When I run The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca..., here are the numbers on the computers I have:

          AMD Ryzen 7 9700X Desktop:
          ----------------------------------------------------------------------------
          Benchmark                                  Time             CPU   Iterations
          ----------------------------------------------------------------------------
          bench_getuid                            38.6 ns         38.5 ns     18160546
          bench_getpid                            39.9 ns         39.9 ns     17703749
          bench_close                             45.2 ns         45.1 ns     15711379
          bench_syscall                           42.2 ns         42.1 ns     16638675
          bench_sched_yield                       81.7 ns         81.6 ns      8623522
          bench_clock_gettime                     15.9 ns         15.9 ns     44010857
          bench_clock_gettime_tai                 15.9 ns         15.9 ns     43997256
          bench_clock_gettime_monotonic           15.9 ns         15.9 ns     44012908
          bench_clock_gettime_monotonic_raw       15.9 ns         15.9 ns     43982277
          bench_nanosleep0                       49961 ns          370 ns       100000
          bench_nanosleep0_slack1                10839 ns          351 ns      1000000
          bench_nanosleep1_slack1                10878 ns          358 ns      1000000
          bench_pthread_cond_signal               1.37 ns         1.37 ns    503715097
          bench_assign                           0.563 ns        0.562 ns   1000000000
          bench_sqrt                              1.63 ns         1.63 ns    430096636
          bench_sqrtrec                           5.33 ns         5.33 ns    132574542
          bench_nothing                          0.394 ns        0.394 ns   1000000000
      
          12th Gen Intel(R) Core(TM) i5-12600H
          ----------------------------------------------------------------------------
          Benchmark                                  Time             CPU   Iterations
          ----------------------------------------------------------------------------
          bench_getuid                            70.0 ns         70.0 ns      9985369
          bench_getpid                            71.6 ns         71.6 ns      9763016
          bench_close                             76.7 ns         76.7 ns      9131090
          bench_syscall                           66.8 ns         66.8 ns     10533946
          bench_sched_yield                        160 ns          160 ns      4377987
          bench_clock_gettime                     12.2 ns         12.2 ns     57432496
          bench_clock_gettime_tai                 12.1 ns         12.1 ns     57826299
          bench_clock_gettime_monotonic           12.2 ns         12.2 ns     57736141
          bench_clock_gettime_monotonic_raw       12.3 ns         12.3 ns     57070425
          bench_nanosleep0                       63154 ns        11834 ns        55756
          bench_nanosleep0_slack1                 2933 ns         1700 ns       348675
          bench_nanosleep1_slack1                 2654 ns         1479 ns       467420
          bench_pthread_cond_signal               1.39 ns         1.39 ns    483995101
          bench_assign                           0.868 ns        0.868 ns    821103909
          bench_sqrt                              1.69 ns         1.69 ns    422094139
          bench_sqrtrec                           4.06 ns         4.06 ns    174511095
          bench_nothing                          0.750 ns        0.750 ns    941204159
      
          AMD Ryzen 5 PRO 7545U Laptop:
          ----------------------------------------------------------------------------
          Benchmark                                  Time             CPU   Iterations
          ----------------------------------------------------------------------------
          bench_getuid                             106 ns          106 ns      6581746
          bench_getpid                             111 ns          111 ns      6271878
          bench_close                              116 ns          116 ns      5944154
          bench_syscall                           85.9 ns         85.9 ns      7317584
          bench_sched_yield                        315 ns          315 ns      2249333
          bench_clock_gettime                     17.6 ns         17.6 ns     39935693
          bench_clock_gettime_tai                 17.6 ns         17.6 ns     39920957
          bench_clock_gettime_monotonic           17.5 ns         17.5 ns     39962966
          bench_clock_gettime_monotonic_raw       17.5 ns         17.5 ns     39561163
          bench_nanosleep0                       52720 ns         3058 ns       100000
          bench_nanosleep0_slack1                13815 ns         2969 ns       244790
          bench_nanosleep1_slack1                13710 ns         2722 ns       254666
          bench_pthread_cond_signal               2.66 ns         2.66 ns    264735233
          bench_assign                           0.930 ns        0.930 ns    813279743
          bench_sqrt                              2.43 ns         2.43 ns    286953468
          bench_sqrtrec                           5.67 ns         5.67 ns    123889652
          bench_nothing                          0.812 ns        0.812 ns    860562208
      

      So, I've tested multiple times in multiple ways, and the results don't seem to match.

      7 replies →