Comment by lukax

9 hours ago

NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.

8 comments

lukax

strifey 1 hour ago

Heck, we saw crazy performance degradation with redis when its memory usage exceeded a single NUMA block. Not much to be done about that at the k8s level when redis is single-threaded. Have to be super conscious of the underlying hardware at that point.

gopalv 3 hours ago

> Kubernetes deployed on a server with hundreds of CPU cores

Was that a Power9 or some sort of IBM machine?

Not all NUMA is the same, ccNUMA from the Intel is a different beast from the PPC version of the same.

re-thc 9 hours ago

Is this on AMD? I wonder if it's all to do with NUMA or their CCD architecture etc (well these days Intel and everyone also does it to some extent).

Twirrim 7 hours ago

Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)
toast0 9 hours ago

Hundreds of cores is likely two sockets and so you've got NUMA there.
Scaling to large core counts has a lot of gotchas.

CarRamrod 8 hours ago

There is one instance where the NUMA performance never disappoints: https://www.youtube.com/watch?v=Cqd1Gvq-RBY

drunkboxer 6 hours ago

There are in fact two instances https://www.youtube.com/watch?v=ZBKm1MBsTbk