Comment by scrubs

12 hours ago

"Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies."

@dragontamer solid point. Consider a in memory ring shared between two threads. There's huge difference in throughput and latency if the threads share L2 (on same core) or when on different cores all down to the relative slowness of L3.

Are there other cpus (arm, graviton?) that have similarly shared L2 caches?

Hyperthreading actually shares L1 caches between two threads (after all, two threads are running in the same L1 cache and core).

I believe SMT4 and SMT8 cores from IBM Power10 also have L1 caches shared (8 threads on one core), and thus benefit from communication speeds.

But you're right in that this is a very obscure performance quirk. I'm honestly not aware of any practical code that takes advantage of this. E-cores are perhaps the most "natural" implementation of high-speed core-to-core communications though.