← Back to context

Comment by truth_seeker

5 days ago

> False Sharing : "Pad for concurrent access: Separate goroutine data by cache lines"

This is worth adding in Go race detector's mechanism to warn developer

Most modern processor architecture CPU cache line sizes are 64 bytes, but not all of them. Once you start to put performance optimizations like optimizing for cache line size, you're fundamentally optimizing for a particular processor architecture.

That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.

  • The big two, x86_64 and arm64, have 64-byte cache lines, so that's a reasonable assumption in practice. But I was surprised to discover that Apple's M-series laptops have 128-byte cache lines, and that's something a lot of people have and run, albeit not as a server.

  • Something like C++17's `std::hardware_destructive_interference_size` would be nice; being able to just say "Align this variable to whatever the cache line size is on the architecture I'm building for".

    If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.

  • Seems like judicious build tag/file extensions would allow for such optimizations with a fallback to no optimization.