Comment by _flux

9 months ago

> If you ever pull up a debugger and step through an async Rust/tokio codebase, you'll get a good sense for what the overhead here we're talking about is.

So I didn't quite do that, but the overhead was interesting to me anyway, and as I was unable to find existing benchmarks (surely they exist?), I instructed computer to create one for me: https://github.com/eras/RustTokioBenchmark

On this wee laptop the numbers are 532 vs 6381 cpu cycles when sending a message (one way) from one async thread to another (tokio) or one kernel thread to another (std::mpsc), when limited to one CPU. (It's limited to one CPU as rdtscp numbers are not comparable between different CPUs; I suppose pinning both threads to their own CPUs and actually measuring end-to-end delay would solve that, but this is what I have now.)

So this was eye-opening to me, as I expected tokio to be even faster! But still, it's 10x as fast as the thread-based method.. Straight up callback would still be a lot faster, of course, but it will affect the way you structure your code.

Improvements to methodology accepted via pull requests :).

I'd want to see perf stats on branch prediction misses and L1 cache evictions alongside that though. CPU cycles on their own aren't enough.

  • It doesn't seem my perf provides metric for L1 cache evictions (per perf list).

    Here's the results for 100000 rounds for taskset 1 perf record -F10000 -e branch-misses -e cache-misses -e cache-references target/release/RustTokioBenchmark (a)sync; perf report --stat though:

    async

        Task 2 min roundtrip time: 532
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0,033 MB perf.data (117 samples) ]
    
        ...    
        branch-misses stats:
                  SAMPLE events:         54
        cache-misses stats:
                  SAMPLE events:         27
        cache-references stats:
                  SAMPLE events:         36
    

    sync

        Thread 2 min roundtrip time: 7096
        [ perf record: Woken up 5584 times to write data ]
        [ perf record: Captured and wrote 0,367 MB perf.data (7418 samples) ]
    
        ...
        branch-misses stats:
                  SAMPLE events:       6577
        cache-misses stats:
                  SAMPLE events:        159
        cache-references stats:
                  SAMPLE events:        682

    • Interesting. Thing is all you're benchmarking is the cost of sending a message on tokio's channels vs mpsc's channels.

      It would be interesting to compare with crossbeam as well.

      But not sure this reflects anything like a real application workflow. In some ways this is the worst possible performance scenario, just two threads spinning and spinning at the fastest speed they can, dumping messages into a channel and pulling them out? It's a benchmark of the channels themselves and whatever locking/synchronization stuff they use.

      It's a benchmark of a "shared concurrent data" situation, with constant synchronization. What would be more interesting is to have longer running jobs doing some task inside themselves and only periodically (ever few seconds, say) synchronizing.

      What's the tokio executor's settings by default there? Multithreaded or not? I'd be curious how e.g. whether tokio is actually using multiple threads or not here.

      1 reply →