Comment by lukax
9 hours ago
NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.
Heck, we saw crazy performance degradation with redis when its memory usage exceeded a single NUMA block. Not much to be done about that at the k8s level when redis is single-threaded. Have to be super conscious of the underlying hardware at that point.
> Kubernetes deployed on a server with hundreds of CPU cores
Was that a Power9 or some sort of IBM machine?
Not all NUMA is the same, ccNUMA from the Intel is a different beast from the PPC version of the same.
Is this on AMD? I wonder if it's all to do with NUMA or their CCD architecture etc (well these days Intel and everyone also does it to some extent).
Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)
Hundreds of cores is likely two sockets and so you've got NUMA there.
Scaling to large core counts has a lot of gotchas.
There is one instance where the NUMA performance never disappoints: https://www.youtube.com/watch?v=Cqd1Gvq-RBY
There are in fact two instances https://www.youtube.com/watch?v=ZBKm1MBsTbk