Comment by truth_seeker
5 days ago
> False Sharing : "Pad for concurrent access: Separate goroutine data by cache lines"
This is worth adding in Go race detector's mechanism to warn developer
5 days ago
> False Sharing : "Pad for concurrent access: Separate goroutine data by cache lines"
This is worth adding in Go race detector's mechanism to warn developer
Most modern processor architecture CPU cache line sizes are 64 bytes, but not all of them. Once you start to put performance optimizations like optimizing for cache line size, you're fundamentally optimizing for a particular processor architecture.
That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.
The big two, x86_64 and arm64, have 64-byte cache lines, so that's a reasonable assumption in practice. But I was surprised to discover that Apple's M-series laptops have 128-byte cache lines, and that's something a lot of people have and run, albeit not as a server.
Something like C++17's `std::hardware_destructive_interference_size` would be nice; being able to just say "Align this variable to whatever the cache line size is on the architecture I'm building for".
If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.
Apple arm64 supposedly has 64-byte L1 cache line size and 128-byte L2? How does that work? Presumably the lines are independent in L1, but can different cores have exclusive access to adjacent lines? What's the point of narrower lines in L1?
1 reply →
Only landed in clang last year: https://github.com/llvm/llvm-project/pull/89446
Seems like judicious build tag/file extensions would allow for such optimizations with a fallback to no optimization.
On which architecture are cache lines not 64 bytes? It's almost universal.
https://cpufun.substack.com/i/32474663/notable-differences
As noted by the other comments, Apple's M-series chips seem to use a 128-byte cache line. ARM doesn't mandate that their licensees must use a pre-specified cache line size: 64 bytes just happens to be the consensus-arrived standard.