Comment by jeffbee

2 days ago

The reason I hedged and said "... or some automatic system ..." was because they use a machine-learned forecast of the memory requirements of every container and use that as the soft limit for the container when it starts. You can read about that at [1]. But what I was getting at is that using less than the configured amount of memory does not lead to more containers able to be scheduled on a given machine, nor does it lead to lower economic chargeback. Machines are scheduled and operators are charged by the configured limit, not the usage.

Giving memory back to the operating system is antithetical to the nature of caching allocators ("caching" is right there in the name of "tcmalloc"). The whole point of a caching allocator is that if you needed the memory once, you'll probably need it again, and most likely right now. At most what these allocators will do unless you configure them differently is to release memory to the system very, very slowly, and only if an entirely empty huge page — a contiguous area of several megabytes — surfaces. You can read how grudgingly the tcmalloc authors allow releasing at [2]. jemalloc was once pretty aggressive about releasing to the OS, but these days it is not. I think this reflects its evolution to suit Meta internal workloads, and increased understanding of the costs of releasing memory from a huge-page-aware allocator.

1: https://dl.acm.org/doi/pdf/10.1145/3342195.3387524 2: https://github.com/google/tcmalloc/blob/master/docs/tuning.m...

1 comment

jeffbee