← Back to context

Comment by cmrdporcupine

9 months ago

I'd want to see perf stats on branch prediction misses and L1 cache evictions alongside that though. CPU cycles on their own aren't enough.

It doesn't seem my perf provides metric for L1 cache evictions (per perf list).

Here's the results for 100000 rounds for taskset 1 perf record -F10000 -e branch-misses -e cache-misses -e cache-references target/release/RustTokioBenchmark (a)sync; perf report --stat though:

async

    Task 2 min roundtrip time: 532
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0,033 MB perf.data (117 samples) ]

    ...    
    branch-misses stats:
              SAMPLE events:         54
    cache-misses stats:
              SAMPLE events:         27
    cache-references stats:
              SAMPLE events:         36

sync

    Thread 2 min roundtrip time: 7096
    [ perf record: Woken up 5584 times to write data ]
    [ perf record: Captured and wrote 0,367 MB perf.data (7418 samples) ]

    ...
    branch-misses stats:
              SAMPLE events:       6577
    cache-misses stats:
              SAMPLE events:        159
    cache-references stats:
              SAMPLE events:        682

  • Interesting. Thing is all you're benchmarking is the cost of sending a message on tokio's channels vs mpsc's channels.

    It would be interesting to compare with crossbeam as well.

    But not sure this reflects anything like a real application workflow. In some ways this is the worst possible performance scenario, just two threads spinning and spinning at the fastest speed they can, dumping messages into a channel and pulling them out? It's a benchmark of the channels themselves and whatever locking/synchronization stuff they use.

    It's a benchmark of a "shared concurrent data" situation, with constant synchronization. What would be more interesting is to have longer running jobs doing some task inside themselves and only periodically (ever few seconds, say) synchronizing.

    What's the tokio executor's settings by default there? Multithreaded or not? I'd be curious how e.g. whether tokio is actually using multiple threads or not here.

    • Actually I wasn't that interested in throughput, only the latency in terms of instructions executed since sending until it is received, though indeed the throughput is also superior with tokio.

      For most applications this difference doesn't really matter, but maybe some applications do a lot of small things where it does matter? In those cases it might be an easy solution to switch from standard threads to tokio async and gain 10x speed, as the structure of the applications remains the same.

      > It's a benchmark of the channels themselves and whatever locking/synchronization stuff they use.

      Yeah, in retrospect some mutex-benchmark might be better, though I don't expect a message channel implemented on top of that is noticeably slower. A mutex benchmark is probably easier to get wrong..

      > What would be more interesting is to have longer running jobs doing some task inside themselves and only periodically (ever few seconds, say) synchronizing.

      I don't quite see how this would give any different results. Of course, in that case the time it takes to transmit the message would be completely meaningless.

      > What's the tokio executor's settings by default there? Multithreaded or not? I'd be curious how e.g. whether tokio is actually using multiple threads or not here.

      It's using the multithreaded executor. I tried the benchmark with #[tokio::main(worker_threads = 1)] and 2 and while with =1 the result was 529 but with =2 it was 566.