Comment by hedora

15 hours ago

I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).

Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.

Are you using async rust, or sync rust?

modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...

  • How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?

    • on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.

      tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.

      for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.

      1 reply →