Comment by menage

5 years ago

It's probably true that public use of such technology can be dated to LXC. But cgroups, and the initial (private) userspace implementations, definitely come from Google.

There were many efforts to get some kind of container technology into Linux, from the early 2000s (and probably earlier). VServer, OpenVZ, etc, were all trying to support full server virtualization.

Cgroups was different from most of them in that it wasn't trying to present a virtual server environment, just facilitate resource control. At Google, all the jobs we ran on Borg (the internal system that later inspired Kubernetes) knew that they were in a shared environment, so there was much less need to fake a private virtual environment. The low-level libraries linked into all Google binaries did a lot of coordination with Borg to make the sharing fairly painless. But we were trying to run tens of jobs on each of our increasingly-larger machines, and some of the batch jobs would inevitably end up hogging resources, which would hurt the important latency-sensitive jobs. CPU and memory isolation were the main things that we cared about, with network/disk as secondary elements.

For a while (in 2005-2007) in Borg we approximated this by using the kernel's "fake NUMA" support (originally intended for testing the NUMA code) to break up the system into a bunch of fake NUMA nodes, and using cpusets to reserve CPUs and memory chunks to important jobs. It was pretty ugly and rather coarse-grained, but it worked rather well and was the first widespread (running on millions of Google Linux servers) userspace system for controlling a cgroups-like system.

Essentially, cgroups piggy-backed on to the existing cpusets mechanism/API which had already been accepted into the kernel, and expanded it into a more generic way of creating a hierarchy of groups and mapping a process to a group. This was a lot simpler to get accepted than an entire virtualization system. Given that mapping, making other resource scheduling be group-aware rather than just process-aware was much more straightforward. The same userspace support in Borg that had been used for controlling cpusets worked pretty much as-is with cgroups, since the basic API was the same (just more resources were supported).