In the containerized environments where these allocators were mainly developed, it is all but totally pointless to return memory to the kernel. You might as well keep everything your container is entitled to use, because it's not like the other containers can use it. Someone or some automatic system has written down how much memory the container is going to use.
Returning no longer used anonymous memory is not without benefits.
Returning pages allows them to be used for disk cache. They can be zeroed in the background by the kernel which may save time when they're needed again, or zeroing can be avoided if the kernel uses them as the destination of a full page DMA write.
Also, returning no longer used pages helps get closer to a useful memory used measurement. Measuring memory usage is pretty difficult of course, but making the numbers a little more accurate helps.
I know Google has good engineering, but I find this a bit implausible?
For most applications, especially request/response type apps like web servers, "right sizing" truly correctly while accounting for spikes takes a lot of engineering effort to fully account for how much allocation a single request will need, then ensuring the maximum concurrent requests never go beyond that so you never risk OOMs.
I can see this being fine-tuned for extremely high-scale, core services like load balancers, SDNs, file systems etc., where you probably want to allocate all your data structures at startup time and never actually allocate anything after that, and you probably have whole teams of engineers devoted to just single services. But not most apps?
Surely it's better for containers to share system memory, and rely on limits and resource-driven autoscaling to make the system resilient?
The reason I hedged and said "... or some automatic system ..." was because they use a machine-learned forecast of the memory requirements of every container and use that as the soft limit for the container when it starts. You can read about that at [1]. But what I was getting at is that using less than the configured amount of memory does not lead to more containers able to be scheduled on a given machine, nor does it lead to lower economic chargeback. Machines are scheduled and operators are charged by the configured limit, not the usage.
Giving memory back to the operating system is antithetical to the nature of caching allocators ("caching" is right there in the name of "tcmalloc"). The whole point of a caching allocator is that if you needed the memory once, you'll probably need it again, and most likely right now. At most what these allocators will do unless you configure them differently is to release memory to the system very, very slowly, and only if an entirely empty huge page — a contiguous area of several megabytes — surfaces. You can read how grudgingly the tcmalloc authors allow releasing at [2]. jemalloc was once pretty aggressive about releasing to the OS, but these days it is not. I think this reflects its evolution to suit Meta internal workloads, and increased understanding of the costs of releasing memory from a huge-page-aware allocator.
glibc is not written in a containerized environment and I personally think it’s telling that a core feature of the more recent tcmalloc Google open sourced is that it returns memory efficiently, so clearly even in containerized environments it’s important. The reason for this is how kernels deal with compressing pages and pages released to the kernel are explicitly zeroed (unlike the user space allocator) which aids in the efficiency of the compression even in a containerized workload because those pages can just be skipped since they’re unused and the kernel can share the reference zeroed page for lazy allocations.
Also the kernel itself has memory needs for lots of things and it not having memory or having to go on a hunt to find contiguous pages is not good. Additionally in a VM or container environment there’s other containers and VMs running on that machine so the memory will also eventually get percolated up to the hyper visor to rebalance. None of this happens if the user space allocator hangs on to memory needlessly in a greedy fashion and indeed such an application would be more subject to the OOM killer.
In the containerized environments where these allocators were mainly developed, it is all but totally pointless to return memory to the kernel. You might as well keep everything your container is entitled to use, because it's not like the other containers can use it. Someone or some automatic system has written down how much memory the container is going to use.
Returning no longer used anonymous memory is not without benefits.
Returning pages allows them to be used for disk cache. They can be zeroed in the background by the kernel which may save time when they're needed again, or zeroing can be avoided if the kernel uses them as the destination of a full page DMA write.
Also, returning no longer used pages helps get closer to a useful memory used measurement. Measuring memory usage is pretty difficult of course, but making the numbers a little more accurate helps.
Zeroed pages also compress more efficiently because the compressor doesn’t actually need to process them.
I know Google has good engineering, but I find this a bit implausible?
For most applications, especially request/response type apps like web servers, "right sizing" truly correctly while accounting for spikes takes a lot of engineering effort to fully account for how much allocation a single request will need, then ensuring the maximum concurrent requests never go beyond that so you never risk OOMs.
I can see this being fine-tuned for extremely high-scale, core services like load balancers, SDNs, file systems etc., where you probably want to allocate all your data structures at startup time and never actually allocate anything after that, and you probably have whole teams of engineers devoted to just single services. But not most apps?
Surely it's better for containers to share system memory, and rely on limits and resource-driven autoscaling to make the system resilient?
The reason I hedged and said "... or some automatic system ..." was because they use a machine-learned forecast of the memory requirements of every container and use that as the soft limit for the container when it starts. You can read about that at [1]. But what I was getting at is that using less than the configured amount of memory does not lead to more containers able to be scheduled on a given machine, nor does it lead to lower economic chargeback. Machines are scheduled and operators are charged by the configured limit, not the usage.
Giving memory back to the operating system is antithetical to the nature of caching allocators ("caching" is right there in the name of "tcmalloc"). The whole point of a caching allocator is that if you needed the memory once, you'll probably need it again, and most likely right now. At most what these allocators will do unless you configure them differently is to release memory to the system very, very slowly, and only if an entirely empty huge page — a contiguous area of several megabytes — surfaces. You can read how grudgingly the tcmalloc authors allow releasing at [2]. jemalloc was once pretty aggressive about releasing to the OS, but these days it is not. I think this reflects its evolution to suit Meta internal workloads, and increased understanding of the costs of releasing memory from a huge-page-aware allocator.
1: https://dl.acm.org/doi/pdf/10.1145/3342195.3387524 2: https://github.com/google/tcmalloc/blob/master/docs/tuning.m...
1 reply →
glibc is not written in a containerized environment and I personally think it’s telling that a core feature of the more recent tcmalloc Google open sourced is that it returns memory efficiently, so clearly even in containerized environments it’s important. The reason for this is how kernels deal with compressing pages and pages released to the kernel are explicitly zeroed (unlike the user space allocator) which aids in the efficiency of the compression even in a containerized workload because those pages can just be skipped since they’re unused and the kernel can share the reference zeroed page for lazy allocations.
Also the kernel itself has memory needs for lots of things and it not having memory or having to go on a hunt to find contiguous pages is not good. Additionally in a VM or container environment there’s other containers and VMs running on that machine so the memory will also eventually get percolated up to the hyper visor to rebalance. None of this happens if the user space allocator hangs on to memory needlessly in a greedy fashion and indeed such an application would be more subject to the OOM killer.