Comment by solatic
5 days ago
Most modern processor architecture CPU cache line sizes are 64 bytes, but not all of them. Once you start to put performance optimizations like optimizing for cache line size, you're fundamentally optimizing for a particular processor architecture.
That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.
The big two, x86_64 and arm64, have 64-byte cache lines, so that's a reasonable assumption in practice. But I was surprised to discover that Apple's M-series laptops have 128-byte cache lines, and that's something a lot of people have and run, albeit not as a server.
Something like C++17's `std::hardware_destructive_interference_size` would be nice; being able to just say "Align this variable to whatever the cache line size is on the architecture I'm building for".
If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.
Apple arm64 supposedly has 64-byte L1 cache line size and 128-byte L2? How does that work? Presumably the lines are independent in L1, but can different cores have exclusive access to adjacent lines? What's the point of narrower lines in L1?
Maybe the point isn't narrower lines in L1 but wider lines in L2? Implicitly bringing in more data to the L2 cache but allowing the CPU to pick smaller chunks of it into L1 cache to work on. Something like a forced prefetch or something? Honestly no idea.
Only landed in clang last year: https://github.com/llvm/llvm-project/pull/89446
Seems like judicious build tag/file extensions would allow for such optimizations with a fallback to no optimization.
On which architecture are cache lines not 64 bytes? It's almost universal.
https://cpufun.substack.com/i/32474663/notable-differences
As noted by the other comments, Apple's M-series chips seem to use a 128-byte cache line. ARM doesn't mandate that their licensees must use a pre-specified cache line size: 64 bytes just happens to be the consensus-arrived standard.