Comment by ekimekim

1 year ago

The other problem with not setting limits is that it's very easy to use more than your requests routinely, and you won't know that you're misconfigured until the one day you have a noisy neighbor and you only get what you asked for.

Monitoring helps, but requires some nuance. For example, your average CPU might look fine at 50%, but in truth you're using 200% for 500ms followed by 0% for 500ms, and when CPU is scarce your latency unexpectedly doubles.

While it doesn't eliminate it entirely (as you rightly point out), enforcing limits even when there's excess CPU available will mostly ensure that your performance doesn't suddenly change due to outside factors, which IMO is more valuable than having higher performance most-but-not-all of the time.

3 comments

ekimekim

Vegemeister 1 year ago

>For example, your average CPU might look fine at 50%, but in truth you're using 200% for 500ms followed by 0% for 500ms, and when CPU is scarce your latency unexpectedly doubles.

That is exactly the behavior that cgroups' cpu.max has, except it'd have to be 50 ms instead of 500 with the default period.

The problem with cpu.max is that people want a "50%" CPU limit to make the kernel force-idle your threads in the same timeslice size you'd get with something else competing for the other 50% of the CPU, but that is not actually what cpu.max does. Perhaps that is what it should do, but unfortunately, the `echo $maxruntime_ns $period_ns >cpu.max` thing is UAPI. Although, I don't know if anyone would complain if one day the kernel started interpreting that as a rational fraction and ignoring the absolute values of the numbers.

This makes me really want to write a program that RDTSCs in a loop into an array, and then autocorr(diff()) the result. That'd probably expose all kinds of interesting things about scheduler timeslices, frequency scaling, and TSC granularity.

ec109685 1 year ago

Yes, in that scenario of 500ms of 200% CPU for a request / response type workload, 50% of responses will have an extra 25ms response time tacked on as the system is sleeping during the remaining portion of each scheduling period.
This goes into detail: https://docs.kernel.org/scheduler/sched-bwc.html

ec109685 1 year ago

If you don’t let people burst, you lose a benefit of multi-tenancy. Each workload stays conservative ensuring they never throttle, and your nodes end up very underutilized since you can’t share that buffer amongst workloads.

With auto scaling, if a workload is using more their allocated CPU, more containers will be brought online to bring down cpu utilization, which will get the system back into balance.