Comment by jeffbee
1 year ago
I would say that this has relatively little to do with Kubernetes in the end. The Kubelet just turns the knobs and pulls the levers that Linux offers. If you understand how Linux runs your program, then what K8s does will seem obvious.
A detail I would like to quibble about: GOMAXPROCS is not by default the number of CPUs "on the node" as the article states. It is the number of set bits in the task's CPU mask at startup. This will not generally be the number of CPUs on the node, since that mask is determined by the number of other tenants and their resource configurations. "Other tenants" includes the kubelet and whatever other system containers are present.
The problem of using this default scheme arises because GOMAXPROCS is latched in once at startup, but the actual CPU mask may change while the task is running, and if you start 100 replicas of something on 100 different nodes they may all end up with various GOMAXPROCS, which will affect the capacity of each replica. So it is better to explicitly set GOMAXPROCS to something reasonable.
> I would say that this has relatively little to do with Kubernetes in the end.
It does. E.g., this issue does not exist with LXD. LXD mounts a custom procfs inside the container that exposes the correct values of system resources allotted to the container. K8s doesn't, probably because k8s started out as a way to run docker containers, and docker couldn't care less about doing things the right way.
See for yourself by running htop in an LXD container and dynamically changing the CPU and Memory limits of the container. Unlike k8s, there's no need to restart the container for the new limits to apply; they update live.
I think it kind of has to do with kubernetes, in that kubernetes embeds assumptions in its design and UI about the existence of a kernel capability which is almost, but not quite, entirely unlike the cpu.max cgroup knob, and then tries to use cpu.max anyway. Leaving CPUs idle when threads are runnable is not normally a desirable thing for a scheduler to do, CPU usage is not measured in "number of cores", and a concurrency limit is about the least-energy-efficient way to pretend you have a slower chip than you really do.
There is a reason these particular users keep stepping on the same rake.
cpu.uclamp.max is a little closer to the mental model k8s is teaching people, but it violates the usage=n_cores model too, and most servers are using the performance governor anyway.
Or just update it at runtime every minute or something.
The go runtime isn't really dynamic in that regard.
It has been from the first version: https://pkg.go.dev/runtime#GOMAXPROCS
You can tail some devices can’t you?