Comment by dgxyz
20 days ago
About 25 years here and 10 years embedded / EE before that.
The problem is that containers are made of images and those and kubernetes are incredibly stateful. They need to be stored. They need to be reachable. They need maintenance. And the control responsibility is inverted. You end up with a few problems which I think are not tenable.
Firstly, the state. Neither docker itself or etcd behind Kubernetes are particularly good at maintaining state consistently. Anyone who runs a large kubernetes cluster will know that once it's full of state, rebuilding it consistently in a DR scenario is HORRIBLE. It is not just a case of rolling in all your services. There's a lot of state like storage classes, roles, secrets etc which nothing works if you don't have in there. Unless you have a second cluster you can tear down and rebuild regularly, you have no idea if you can survive a control plane failure (we have had one of those as well).
Secondly, reachability. The container engine and kubernetes require the ability to reach out and get images. This is such a fucking awful idea from a security and reliability perspective it's unreal. I don't know how people even accept this. Typically your kubernetes cluster or container engine has the ability to just pull any old shit off docker hub. That also couples to you that service being up, available and not subject to the whims of whatever vendor figures they don't want to do their job any more (broadcom for example). To get around this you end up having to cache images which means more infrastructure to maintain. There is of course a whole secondary market for that...
Thirdly, maintainance. We have about 220 separate services. When there's a CVE, you have to rebuild, test and deploy ALL those containers. We can't just update an OS package and bounce services or push a new service binary out and roll it. It's a nightmare. It can take a month to get through this and believe me we have all the funky CD stuff.
And as mentioned above, control is inverted. I think it's utterly stupid on this basis that your container engine or cluster pulls containers in. When you deploy, the relationship should be a push because you can control that and mandate all of the above at once.
In the attempt to solve problems, we created worse ones. And no one is really happy.
I get your points but I'm not sure I agree. Kubernetes is a different kind of difficulty but I don't think its so different from handling VM fleets.
You can have 220 vms instead and need to update all of them too. They also are full of state and you will need some kind of automatic deployment (like ansible) to make it bearable, just like your k8s cluster. If you don't configure the network egress firewall, they can also both pull whatever images/binaries from docker hub/internet.
> To get around this you end up having to cache images which means more infrastructure to maintain
If you're not doing this for your VMs packages and your code packages, you have the same problem anyway.
> When there's a CVE
If there is a CVE in your code, you have to build all you binaries anyway. If it's in the system packages, you have to update all your VMs. Arguably, updating a single container and making a rolling deployment is faster than updating x VMs. In my experience updating VMs was harder and more error prone than updating a service description to bump a container version (you don't just update a few packages, sometimes you need to go from Centos 5 to Centos 7/8 or something and it also takes weeks to test and validate).
I mostly agree with you, with the exception that VMs are fully isolated from one another (modulo sharing a hypervisor), which is both good and bad.
If your K8s cluster (or etcd) shits the bed, everything dies. The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
As to updating VMs, again IME, typically you’d generate machine images with something like Packer + Ansible, and then roll them out with some other automation. Once that infrastructure is built, it’s quite easy, but there are far more examples today of doing this with K8s, so it’s likely easier to do that if you’re just starting out.
> If your K8s cluster (or etcd) shits the bed, everything dies.
When etcd and/or kubelet shits the bed, it shouldn't do anything other than halt scheduling tasks. The actual runtime might vary between setups, but typically containerd is the one actually handling the individual pod processes.
Of course, you can also run Kubernetes pods in a VM if you want to, there have always been a few different options for this. I think right now the leading option is Kata Containers.
Does using Kata Containers improve isolation? Very likely: you have an entire guest kernel for each domain. Of course, the entire isolation domain is subject to hardware bugs, but I think people do generally regard hardware security boundaries somewhat higher than Linux kernel security boundaries.
But, does using Kata Containers improve reliability? I'd bet not, no. In theory it would help mitigate reliability issues caused by kernel bugs, but frankly that's a bit contrived as most of us never or extremely infrequently experience the type of bug that mitigates. In practice, what happens is that the point of failure switches from being a container runtime like containerd to a VMM like qemu or Firecracker.
> The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
The way I see it, mature code is less likely to have surprise showstopper bugs. However, if we're talking qemu + KVM, that's a code base that is also rather old, old enough that it comes from a very different time and place for security practices. I'm not saying qemu is bad, obviously it isn't, but I do believe that many working in high-assurance environments have decided that qemu's age and attack surface is large enough to have become a liability, hence why Firecracker and Cloud Hypervisor exist.
I think the main advantage of a VMM remains the isolation of having an entire separate guest kernel. Though, you don't need an entire Linux VM with complete PC emulation to get that; micro VMs with minimal PC emulation (mostly paravirtualization) will suffice, or possibly even something entirely different, like the way gVisor is a VMM but the "guest kernel" is entirely userland and entirely memory safe.
I think his point is that instead of hundreds of containers, you can just have a small handful of massive servers and let the multitasking OS deal with it