Comment by phire
11 hours ago
The short answer is that L3 is local to each CCD.
And that answer is good enough for most workloads. You should stop reading now.
_______________________
The complex answer is that there is some ability one CCD to pull cachelines from the other CCD. But I've never been able to find a solid answer for the limitations on this. I know it can pull a dirty cache line from the L1/L2 of another CCDs (this is the core-to-core latency test you often see in benchmarks, and there is an obvious cross-die latency hit).
But I'm not sure it can pull a clean cacheline from another CCD at all, or if those just get redirected to main memory (as the latency to main memory isn't that much higher than between CCDs). And even if it can pull a clean cacheline, I'm not sure it can pull them from another CCD's L3 (which is an eviction cache, so only holds clean cachelines).
The only way for a cacheline to get into a CCD's L3 is to be evicted from an L2 on that core, so if a dataset is active across both CCDs, it will end up duplicated across both L3s. Cachelines evicted from one L3 do NOT end up in another L3, so an idle CCD can't act as a pseudo L4.
I haven't seen anyone make a benchmark which would show the effect, if it exists.
No comments yet
Contribute on Hacker News ↗